*Article* **ELCT-YOLO: An Efficient One-Stage Model for Automatic Lung Tumor Detection Based on CT Images**

**Zhanlin Ji 1,2, Jianyong Zhao 1, Jinyun Liu 1, Xinyi Zeng 1, Haiyang Zhang 3, Xueji Zhang 4,\* and Ivan Ganchev 2,5,6,\***


**Abstract:** Research on lung cancer automatic detection using deep learning algorithms has achieved good results but, due to the complexity of tumor edge features and possible changes in tumor positions, it is still a great challenge to diagnose patients with lung tumors based on computed tomography (CT) images. In order to solve the problem of scales and meet the requirements of real-time detection, an efficient one-stage model for automatic lung tumor detection in CT Images, called ELCT-YOLO, is presented in this paper. Instead of deepening the backbone or relying on a complex feature fusion network, ELCT-YOLO uses a specially designed neck structure, which is suitable to enhance the multi-scale representation ability of the entire feature layer. At the same time, in order to solve the problem of lacking a receptive field after decoupling, the proposed model uses a novel Cascaded Refinement Scheme (CRS), composed of two different types of receptive field enhancement modules (RFEMs), which enables expanding the effective receptive field and aggregate multi-scale context information, thus improving the tumor detection performance of the model. The experimental results show that the proposed ELCT-YOLO model has strong ability in expressing multi-scale information and good robustness in detecting lung tumors of various sizes.

**Keywords:** lung cancer; tumor; CT image; one-stage detector; YOLO; multi-scale; receptive field

**MSC:** 68W11; 9404

#### **1. Introduction**

Lung cancer is a common disease that has a higher mortality rate than other cancers. It is the main cause of cancer death [1]. According to the American Cancer Society, the number of new lung cancer cases in the United States is expected to reach 238,340 this year, with 127,070 deaths because of this. Computed tomography (CT) imaging is the most commonly employed method for detecting lung diseases [2,3]. Regular CT screening for people at high risk of developing lung cancer can reduce the risk of dying from this disease. Professional doctors can diagnose lung cancer according to the morphological characteristics of the lesions in CT images. However, CT scans produce huge amounts of image data, which increases the difficulty of performing proper disease diagnosis. Furthermore, doctors may make a wrong diagnosis due to long work shifts and monotonous working. In addition, even experienced doctors and experts can easily miss some small potential lesions. Therefore, automatic detection of lung tumors, based on CT images, needs to be further advanced for improving the quality of diagnosis.

**Citation:** Ji, Z.; Zhao, J.; Liu, J.; Zeng, X.; Zhang, H.; Zhang, X.; Ganchev, I. ELCT-YOLO: An Efficient One-Stage Model for Automatic Lung Tumor Detection Based on CT Images. *Mathematics* **2023**, *11*, 2344. https:// doi.org/10.3390/math11102344

Academic Editors: Snezhana Gocheva-Ilieva, Atanas Ivanov and Hristina Kulina

Received: 9 April 2023 Revised: 6 May 2023 Accepted: 15 May 2023 Published: 17 May 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Accurate detection of lung cancer is a challenging task. On the one hand, the tumors have complex edge features and may change their position [4]. As illustrated in Figure 1a, showing the CT chest images of patients with lung cancers, the texture, gray scale, and shape of tumors are important for clinical staging and pathological classification [5]. On the other hand, redundant image information causes difficulties in the detection task. For example, the images of abundant blood vessels, bronchi, and tiny nodules in the lung interfere with the unique features of tumors. In addition, tumors have different sizes (Figure 1b) and different types of tumors have different growth rates. For example, the multiplication rate of lung squamous cell carcinoma is lower than that of lung adenocarcinoma. Moreover, tumors of the same type have different sizes at different stages of their development [6]. In addition, a tumor naturally has different sizes in multiple CT scanning slices. The challenge brought by the difference in tumor sizes seriously limits the accuracy of existing methods for tumor detection.

**Figure 1.** (**a**) Sample CT chest images of four patients with lung cancer showing round or irregular masses of different size with uniform or nonuniform density; (**b**) tumor size distribution in the dataset, used in the experiments presented further in this paper (the tumor sizes are highly variable, making it difficult to accurately locate and classify tumors).

To date, a lot of work has been done on automatic detection of lung lesions. The early computer-aided lung cancer detection methods mainly relied on an artificially designed feature extractor. Feature extractor can obtain the gray scale, texture and other morphological features of a tumor in an image, which are subsequently fed into a Support Vector Machine (SVM) or AdaBoost for classification. However, the artificially designed features cannot well correspond to the highly variable tumor size, position, and edge, thus limiting the detection ability of these methods [7]. Recently, as deep learning has been increasingly applied in various medical and health care fields, and many researchers have devoted themselves to the study of lung tumor detection based on deep neural networks (DNNs) [8,9]. Unlike traditional methods relying on artificial design, DNNs have a large number of parameters and can fit semantic features better.

Gong et al. [10] used a deep residual network to identify lung adenocarcinoma in CT images, and obtained comparable or even superior outcomes compared to radiologists. Mei et al. [11] conducted experiments on the PN9 dataset to detect lung nodules in CT scans using a slice-aware network. The results showed that the proposed SANet outperformed other 2D and 3D convolutional neural network (CNN) methods and significantly reduced the false positive rate (FPR) for lung nodules. Xu et al. [12] designed a slice-grouped domain attention (SGDA) module that can be easily embedded into existing backbone networks to improve the detection network's generalization ability. Su et al. [13] used the Bag of Visual Words (BoVW) and a convolutional recurrent neural network (CRNN) to

detect lung tumors in CT images. The model first segments the CT images into smaller nano-segments using biocompatibility techniques, and then classifies the nano-segments using deep learning techniques. Mousavi et al. [14] introduced a detection approach based on a deep neural network for identifying COVID-19 and other lung infections. More specifically, their method involves using a deep neural network to extract features from chest X-ray images, employing an LSTM network for sequence modeling, and utilizing a SoftMax classifier for image classification. This method shows excellent performance in detecting COVID-19 and can help radiologists make diagnoses quickly. In [15], Mei et al. utilized a depth-wise over-parameterized convolutional layer to construct a residual unit in the backbone network, leading to improved feature representation ability of the network. Moreover, the study also implemented enhancements in the confidence loss function and focal loss to handle the significant imbalance between positive and negative samples during training. It is noteworthy that this method focuses on the efficiency and practicability of the detector. Version 4 of You Only Look Once (YOLO), i.e., YOLOv4, was used as a benchmark for this method but there have been few studies using YOLO to detect lung tumors so far.

Although many processing methods exist for automated detection of lung tumors in CT images, the variability of tumor size is less considered. As indicated above, the size of lung tumors exhibits variability, thus posing challenges for precise tumor detection. As the multi-scale issue constrains the efficacy of prevalent detection methods, some researchers have paid attention to this issue and proposed improvements to the existing methods. Causey et al. [16] utilized 3D convolution in combination with Spatial Pyramid Pooling (SPP) to develop a lung cancer detection algorithm, which enabled reducing the FPR based on the National Lung Screening Trial (NLST) data cohort used for testing, whereby the area under the curve (AUC) value reached 0.892, proving that the detection performance is better than that of using only 3D convolution. Compared with detecting 2D slices one by one, 3D convolution can be used to obtain rich space and volume information in adjacent slices, and models can be generalized to sparsely annotated datasets. However, 3D convolution consumes more computer memory than conventional convolution. Other studies have proposed feature pyramid networks (FPNs), whereby the recognition of small-size tumors depends on features from the shallow network, while the top-level network has more abundant semantic information, which is important for the accurate classification of tumors. The purpose of FPN is to connect the feature maps spanning different layers, so as to restore the low-resolution information of the deep feature map and enhance the semantic information of the shallow feature map. In order to effectively integrate multi-scale information, Guo et al. [17] fused feature maps at different layers. In [18], Guo and Bai construct a FPN to detect multi-scale lung nodules, thus significantly improving the accuracy of small lung nodule detection. In [19], by applying a bi-directional FPN (BiFPN), the feature fusion structure of YOLOv5 was improved, and a fusion path was added between features at the same layer. Some other improvements of feature fusion network have also achieved good results in other tasks [20]. The original FPN structure and its variants adopt complex cross-scale connections to obtain a stronger multi-scale representation ability. Although helpful for improving multi-scale tumor detection, this operation requires more parameters and increases computational expenses, so it is contrary to the general expectation of a highly efficient detector.

Taking inspiration from the previous work, we have considered a real-world hospital scenario where a large volume of CT data is available, but hardware resources are limited. To reduce hardware costs while maintaining the speed of tumor detection, we have selected YOLOv7 as the underlying framework as it can achieve good balance between accuracy and tumor detection speed without requiring generation of candidate boxes, as opposed to the two-stage detection models. In this paper, we propose a novel one-stage detection model, called ELCT-YOLO, based on the popular YOLOv7-tiny network architecture [21], for solving the problem of multi-scale lung tumor detection in CT scan slices. For ELCT-YOLO, firstly, we designed a Decoupled Neck (DENeck) structure to improve the multi-scale feature representation ability of the model. Different from the previous

design scheme of feature fusion structure [22,23], we do not stack a large number of basic structures, nor build a complex topology. We propose the idea of decoupling the feature layer into a high-semantic region and low-semantic region, as to reduce semantic conflict in the fusion process. Secondly, we propose a Cascaded Refinement Scheme (CRS), which includes a group of Receptive Field Enhancement Modules (RFEMs) to explore rich context information. Using atrous convolution, we constructed two multi-scale sensing structures, namely a Series RFEM (SRFEM) and a Parallel RFEM (PRFEM). In order to expand the effective receptive field, these serial structures use a series of atrous convolutions with different sampling rates. At the same time, a residual connection was applied to alleviate the grid artifacts as per [24]. The achieved parallel structure can construct complementary receptive fields, in which each branch matches the amount of information based on its own receptive field. In addition, we studied the performance of different cascaded schemes through experiments.

The main contributions of this paper can be summarized as follows:


#### **2. Related Work**

Deep learning-based object detection methods have significance value in medical applications, such as breast cancer detection [26], retinal lesion detection [27], rectal cancer detection [28], and lung nodule detection [29]. Many of the methods listed above rely on the YOLO family as their foundation, demonstrating a very fast processing speed. Although the YOLO family has a good speed–precision balance, it is not effective in detecting lesions with scale changes in CT images. In the next subsections, we first introduce the YOLO principles and then introduce the current popular methods to deal with multi-scale problems, namely, the feature pyramids and exploring multi-scale context information [30].

#### *2.1. The YOLO Family*

Two major types of deep learning models are currently employed for object detection. The first kind pertains to object detection models that rely on region proposals, such as Regions with CNN (R-CNN), Fast R-CNN, Faster R-CNN, and Mask R-CNN. The second type are object detection models that use regression analysis for detection, such as the YOLO series and SSD series. While the accuracy of two-stage object detection models has improved significantly over time, their detection speed is limited by their structure [31]. The YOLO model [32] was the pioneering one-stage detector in the field of deep learning, as proposed by Redmon et al. in 2015. The main dissimilarity between one-stage and two-stage object detectors relates to the fact that the former do not have a candidate region recommendation stage, which enables them to directly determine the object category and get the position of detection boxes in one stage. Due to YOLO's good speed–precision balance, YOLO's related research has always received much attention. With the introduction of subsequent versions of YOLO, its performance continues to improve.

The second version of YOLO, YOLOv2 [33], uses Darknet-19 as a backbone network, removes the full connection layer, and uses a pooling method to obtain fixed-size feature vectors. A 13 × 13 feature map is obtained after down-sampling of a 416 × 416 input image 5 times. YOLOv2 uses the ImageNet dataset and the Common Objects in COntext (COCO) dataset to train the detector and locate the position of objects in the detection dataset, and utilizes the classification dataset to increase the categories of objects recognized by the detector. This joint training method overcomes the limitation of object detection tasks in terms of categories.

To enhance multi-scale prediction accuracy, YOLOv3 [34] introduces a FPN, using the feature maps of C3, C4, and C5 in Darknet-53 and combined horizontal connections. Finally, the model generates prediction maps for three different scales, enabling it to detect objects of various sizes, including large, medium, and small ones. By using a K-means clustering algorithm, YOLOv3 analyzes the information in the ground truth box of the training dataset to obtain nine types of prior bounding boxes, which can cover the objects of multiple scales in the dataset. Each prediction branch uses anchors to generate three kinds of prediction boxes for the object falling into the region, and finally uses a non-maximum suppression algorithm to filter the prediction box set. Compared with the previous two versions, YOLOv3 improves the detection ability and positioning accuracy of small objects.

YOLOv4 [35] uses CSPDarknet-53 as a backbone network, which combines DarketNet-53 with a Cross-Stage Partial Network (CSPNet). The neck of YOLOv4 uses SPP and Path Aggregation Network (PANet) modules. The fundamental concept of the SPP module is to leverage the average pooling operation of different sizes to extract features, which helps obtain rich context information. PANet reduces the transmission path of information by propagating positional information from lower to higher levels. Unlike the original PANet, YOLOv4 replaces the original shortcut connection with the tensor *concat*. In addition, YOLOv4 also uses Mosaic and other data enhancement methods.

Jocher et al. [25] introduced YOLOv5, which is the first version of YOLO to use Pytorch. Due to the mature ecosystem of Pytorch, YOLOv5 deployment is simpler. YOLOv5 adds adaptive anchor box calculation. When the best possible recall is less than 0.98, the K-means clustering algorithm is utilized to determine the most suitable size for the anchor boxes. YOLOv5 uses an SPPF module to replace the SPP module. In Figure 2, SPPF employs several cascaded pooling kernels of small sizes instead of the single pooling kernel of large size used in the SPP module, which further improves the operational speed. In the subsequent neck module, YOLOv5 replaces the ordinary convolution with CSP\_2X structure to enhance feature fusion.

**Figure 2.** The SPPF module structure.

YOLOv6 [36] also focuses on detection accuracy and reasoning efficiency. YOLOv6 s can achieve an average precision (AP) of 0.431 on COCO and a reasoning speed of 520 frames per second (FPS) on Tesla T4 graphics cards. Based on the Re-parameterization VGG (RepVGG) style, YOLOv6 uses re-parameterized and more efficient networks, namely EfficientRep in the backbone and a Re-parameterization Path Aggregation Network (Rep-PAN) in the neck. The Decoupled Head is optimized, which reduces the additional delay overhead brought by the decoupled head method while maintaining good accuracy. In terms of training strategy, YOLOv6 adopts the anchor-free paradigm, supplemented by a

simplified optimal transport assignment (SimOTA) label allocation strategy and a SIoU [37] bounding box regression loss in order to further improve the detection accuracy.

YOLOv7 [21] enhances the network's learning ability without breaking the original gradient flow by utilizing an extended efficient layer aggregation network (E-ELAN) module (Figure 3). In addition, YOLOv7 utilizes architecture optimization methods to enhance object detection accuracy without increasing the reasoning costs, redesigns the re-parameterized convolution by analyzing the gradient flow propagation path, introduces an auxiliary head to improve its performance, and employs a new deep supervision label allocation strategy. The ELCT-YOLO model, proposed in this paper, is based on improvements of the popular YOLOv7-tiny network architecture [21], as described further in Section 3.

**Figure 3.** The E-ELAN module structure.

YOLOv8, the latest version of YOLO, has achieved a significant improvement in both detection accuracy and speed, lifting the object detection to a new level. YOLOv8 is not only compatible with all previous YOLO versions, but also adopts the latest anchor-free paradigm, which reduces the computational load and breaks away from the width and height limit of fixed anchor boxes. However, the author of YOLOv8 has not published a paper to explain its advantages in detail.

#### *2.2. Multi-Scale Challenge and FPN*

Both the one-stage object detectors and the two-stage object detectors face the challenge of multi-scale detection. As mentioned in the introduction, tumors in different CT images and the focus area in different sections of the same tumor have differences in scale. The existing CNNs have limited ability to extract multi-scale features, because continuous pooling operations or convolution operations with a step size greater than 1 lead to the reduction in the resolution of the feature map, resulting in a conflict between semantic information and spatial information [38]. An approach commonly used to address the challenge of detecting objects at multiple scales is to create an FPN by combining features from different layers.

An FPN architecture that combines the deep feature map with the shallow feature map was proposed by Lin et al. in [39]. They believed that the network's deep features contain strong semantic information, while the shallow features contain strong spatial information. The combination is achieved through multiple up-sampling layers. By utilizing the inherent feature layer of ConvNet, FPN constructs a feature pyramid structure that can greatly enhance the detection network's ability to handle objects of various scales, with minimal additional cost. The use of this network structure has become prevalent for addressing multi-scale problems in the realm of object detection due to its efficacy and versatility.

By integrating a bottom-up pathway with FPN architecture, PANet [40] can effectively enhance the spatial information within the feature pyramid structure. NAS-FPN [22] uses the Natural Architecture Search (NAS) algorithm to find the optimal cross-scale connection architecture. It is believed that the artificially designed feature pyramid structure has limited representation ability. In addition, BiFPN [23] and Recursion-FPN [41] propose weighted feature fusion and detector backbone based on looking and thinking twice, respectively to obtain a strong feature representation. Generally speaking, these methods focus on introducing additional optimization modules to obtain better multi-scale representation. In the ELCT-YOLO model, proposed in this paper, we use a decoupling method to aggregate

multi-scale features, which allows to detect lung tumors more accurately without increasing the complexity of the model.

#### *2.3. Exploring Context Information by Using Enlarged Receptive Field*

Rich context information is helpful for detecting objects with scale changes [42]. Many studies have explored context information using enlarged receptive field, which are realized mostly through pooling operations or atrous convolution.

PoolNet [43] proposes a global guidance module, which first uses adaptive average pooling to capture picture context information, and then fuses the information flow into the feature map of different scales to highlight objects in complex scenarios. ThunderNet [44] applies average pooling to obtain global contextual features from the highest level of the backbone network, which is then aggregated with features at other layers to increase the receptive field of the model. Generally speaking, in order to obtain abstract information, CNN needs to repeat pooling operations, which results in focusing only on the local region. The lack of position information is detrimental to (intensive) detection tasks. Atrous convolution is a popular solution to this problem.

ACFN [45] and Deeplab-v2 [38] use atrous convolutions with various dilation rates instead of the repeated pooling operation in CNN. This strategy can enlarge the receptive field while preserving complete positional information. Liu et al. [46] have built a multisensor feature extraction module, which aggregates multi-scale context information by using atrous convolution with a same-size convolution kernel but different dilation rates. However, while improving the receptive field of the network, atrous convolution also brings challenges because discrete sampling may lose some information and make the weight matrix discontinuous. Furthermore, the irregularly arranged atrous convolution with different dilation rates can aggravate this problem. This situation, called the gridding effect, is analyzed in [24]. Inspired by the above methods, we have designed two types of RFEMs a SRFEM using a serial combination and a PRFEM using a parallel combination—to sense multi-scale context information, which apply an appropriate dilation rate combination and residual connection to reduce the gridding effect.

#### **3. Proposed Model**

#### *3.1. Overview*

The proposed ELCT-YOLO model, shown in Figure 4, is based on YOLOv7-tiny, which is a popular and efficient object detector. With an input image size of 512 × 512 × 3, where 512 represents the image's width and height and 3 represents the number of channels, the features are efficiently extracted through a backbone network, which is mainly based on E-ELAN modules. A SPPF module is added at the top of the backbone network to extract important context features. By concatenating feature maps of various scales, the SPPF module enhances the network's receptive field and boosts both the efficiency of detecting defects at multiple scales and the reasoning speed. The output feature maps of *C*3, *C*4, and *C*5, obtained in the backbone at three different scales corresponding, respectively, to 8, 16, and 32 times down-sampling, are inputted to the neck for feature aggregation. As described in Section 2, the neck structure has an important impact on the accurate detection of lung tumors with scale changes. Therefore, we redesigned the original neck structure of YOLOv7-tiny, and called it DENeck, by decoupling the feature pyramid into a high-semantic region and low-semantic region. Further, we propose a CRS structure to enhance the multi-scale feature representation capability by expanding the receptive field of the low-level semantic region. The three detection heads are used for anchor box classification and regression of large, medium, and small objects, respectively. The network performs detection on the feature maps output by the three detection heads P3, P4, and P5, whose corresponding scales are (16, 3, 80, 80, 7), (16, 3, 40, 40, 7), and (16, 3, 20, 20, 7), respectively. The first-dimension value (i.e., 16) in the output of the ELCT-YOLO detection head represents that the model processes 16 images at once. The second-dimension value (i.e., 3) represents the use of k-means clustering to obtain three prior boxes of different

sizes. The values 80, 40, and 20 in the third and fourth dimensions represent the detection of images at different granularities, corresponding to receptive fields of 8 × 8, 16 × 16, and 32 × 32, respectively. The fifth-dimension value represents the model's prediction information, including the predicted box information, confidence in the presence of tumors, and classification information for adenocarcinoma and small cell carcinoma.

**Figure 4.** A high-level structure of the proposed ELCT-YOLO model (SRFEM is used for all CRSs as it enables achieving the best results, c.f., Section 4.5).

#### *3.2. Decoupled Neck (DENeck)*

#### 3.2.1. Motivation

In YOLOv7-tiny, the neck structure adopts a structure similar to the PANet to cope with difficulty of performing multi-scale object detection. Its core idea is to use the multiscale expression built in CNN, which is generated by repeated down-sampling or pooling operations. As described in Section 2, PANet first fuses feature information from top level to bottom level, and then constructs a bottom-up secondary fusion path to generate enhanced semantic and detail information. However, this design may not be suitable in each situation.

First of all, this fusion method ignores the semantic differences of features with different scales [47]. In a linear combination of feature layers, the adjacent feature layers are closer in semantic information, while the feature layers that are far away not only bring detailed information in semantics or space, but also introduce confusion information in the process of transmission. Further, we believe that this conflict is more obvious in the process of processing CT images. Unlike natural images, CT images are reconstructed by a specific algorithm based on the X-ray attenuation coefficient. The quality of CT images is limited by specific medical scenarios. Compared with the dataset commonly used in computer vision tasks, CT images have a single background and low contrast [48]. The characteristics of the CT images determine their low-level features, such as the tumor edge and shape, which need to be paid attention to in the process of tumor detection. The semantic confusion will destroy the details of the lesions. Based on this logic, our designed neck network reduces semantic conflicts by a decoupling method. This enhances the model's ability to detect tumors at different scales and emphasizes the tumor region in the CT image.

#### 3.2.2. Structure

We use the backbone of YOLOv7-tiny as a benchmark, where {*C*3, *C*4, *C*5} represent the corresponding feature layer generated by the backbone. The corresponding output feature map of the same space size is denoted by *Pout* <sup>3</sup> , *<sup>P</sup>out* <sup>4</sup> , *<sup>P</sup>out* 5 , and the stride of the feature map relative to the input image is {3, 4, 5} pixels.

As shown in Figure 4, the *P*3 branch in the blue area corresponds to low-level semantic information, including details of tumor edge and shape. At the same time, it is noted that canceling the information from *P*4 and *P*5 will lead to insufficient receptive fields of low-level semantic branches, so we propose a CRS structure to increase the receptive fields of low-level semantic region and improve the multi-scale feature representation ability. The *P*4 and *P*5 branches in the yellow area in Figure 4 correspond to high-level semantic information, which is crucial to determine the tumor type. We maintain a cross-scale feature fusion between higher levels because there is less conflict between them.

The designed DENeck feature aggregation method is as follows:

$$P\_3^{\text{out}} = RFEM(\text{C3}) \tag{1}$$

$$P\_4^{out} = E - ELAN(concat[B4, residue(C5)]) + C4\tag{2}$$

$$P\_5^{\text{out}} = E - ELAN\left(concat\left[B5, down\left(P\_4^{td}\right)\right]\right) + CS \tag{3}$$

where RFEM can be either a Series RFEM (SRFEM) or a Parallel RFEM (PRFEM), both of which were tried in different cascaded combinations for use in the proposed model (c.f., Section 4.5); "+" denotes element-wise addition; *B*4 and *B*5 correspond to *C*4 and *C*5 output by 1 × 1 convolution, respectively (we use independent 1 × 1 convolutional layers at different levels to reduce the differences in features between levels); *Ptd* <sup>4</sup> denotes the feature obtained by fusing *C*4 and *C*5 after *resize* operation, which includes up-sampling in alignment with resolution and 1 × 1 convolution adjustment dimension; *concat* and *down* denote the tensor splicing operation and down-sampling operation, respectively. E-ELAN is used after *concat* to reduce the aliasing caused by fusion. Batch normalization (BN) and Sigmoid Weighted Liner Unit (SiLU) activation functions are used behind all the convolutional layers in the DENeck structure.

#### *3.3. Cascaded Refinement Scheme (CRS)*

While the Decoupled Neck paradigm helps improve detection performance, it leads to loss of receptive fields. Low-level semantic features are short of receptive fields that are large enough to capture global contextual information, causing the detectors to confuse tumor regions with their surrounding normal tissues. In addition, tumors of different sizes in CT images should match receptive fields of different scales.

In response to this, we propose a CRS structure to further improve the effective receptive fields. CRS consists of two types of modules—a SRFEM, shown Figure 5, and a PRFEM, shown in Figure 6. They both use dilated convolutions with different dilation rates to adjust the receptive fields.

**Figure 5.** The SRFEM structure.

**Figure 6.** The PRFEM structure.

Different from the normal convolution, in dilated convolution, the convolution kernel values are separated by fixed intervals, which can increase the size of the perception area without changing the parameters [45]. If *x*(*m*, *n*) is the input of the dilated convolution, then its output *y*(*m*, *n*) is defined as follows:

$$y(m, n) = \sum\_{i=1}^{M} \sum\_{j=1}^{N} x(m + r \times i, n + r \times j) w(i, j) \tag{4}$$

where *M* and *N* denote the size of the convolution kernel (the normal convolution kernel is of size *M* = 3, *N* = 3), *w*(*i*, *j*) is a specific parameter of the convolution kernel, and *r* denotes the dilated convolution sampling rate (i.e., the number of zeros between non-zero values in the convolution kernel). Different values of *r* can be set to obtain corresponding receptive fields. When *r* = 1, the receptive field is 3 × 3, and when *r* = 2 and *r* = 3, the receptive field is expanded to 5 × 5 and 7 × 7, respectively. The amount of computation is always the same as the normal convolution of *M* = 3, *N* = 3. This operation is often used to expand the receptive fields of the network while preserving the spatial resolution of the feature map.

For the dilated convolution of *k* × *k*, the formulae for calculation of the equivalent receptive field (*RF*) and resolution (*H*) of the output feature map are the following:

$$RF = (r-1)(k-1) + k \tag{5}$$

$$H = \frac{h + 2p - RF}{s} + 1\tag{6}$$

where *p*, *h*, and *s* represent the padding filling, input feature map resolution, and convolution stride size, respectively.

#### 3.3.1. Series Receptive Field Enhancement Module (SRFEM)

In CT images, tumor detection is prone to the interference of surrounding normal tissues, especially for tumors with small differences in gray levels [49]. The objective of SRFEM is to enlarge the effective receptive field, which helps mitigate the influence of non-lesion regions and emphasize the tumor targets. We also took into account the issue of losing details due to sparsely sampled dilated convolutions, which is more prominent when multiple consecutive dilated convolutions are applied [24].

As shown in Figure 5, SRFEM uses three dilated convolutions with a 3 × 3 convolution kernel and a shortcut connection to form a residual structure, where the dilation rates of the dilated convolutions are 1, 3, and 5, respectively.

Let the given input feature be *x*. Then, the SRFEM output is expressed as follows:

$$y = \text{SiLII}\left(input + \text{Conv}\_3^5\left(Conv\_3^3\left(Conv\_3^1(\mathbf{x})\right)\right)\right) \tag{7}$$

where *Conv*<sup>1</sup> 3, *Conv*<sup>3</sup> 3, and *Conv*<sup>5</sup> <sup>3</sup> denote the dilation rates of 1, 3, and 5 corresponding to a 3 <sup>×</sup> 3 convolution, respectively. *Conv*<sup>5</sup> <sup>3</sup> applies BN, while both *Conv*<sup>1</sup> <sup>3</sup> and *Conv*<sup>3</sup> <sup>3</sup> apply BN and SiLU. *Conv*<sup>5</sup> 3 *Conv*<sup>3</sup> 3 *Conv*<sup>1</sup> <sup>3</sup>(*x*) aims to obtain a feature map with a sufficiently large receptive field, which is added to the shortcut connection to stack deeper networks. The fused features lead to *y* through a SiLU activation function. Compared to the input features, the number of channels of *y* remains unchanged.

#### 3.3.2. Parallel Receptive Field Enhancement Module (PRFEM)

PRFEM aims to construct a multi-branch structure, that extracts corresponding spatial scale features with different receptive fields, and then stitches these features together to obtain a complete expression of the image. Chen et al. [38] first used dilated convolution to build a spatial pyramid module, called ASPP, in DeeplabV2. ASPP is a multi-branch structure consisting of four 3 × 3 convolution branches, whose corresponding dilation rates are 6, 12, 18, and 24, respectively. They can capture richer semantic information. However, when the dilation rate is too large, the acquisition of local semantic information is compromised [50].

The inspiration of PRFEM comes from DeeplabV2. The difference lies in that PRFEM is used to generate uniform receptive fields which adapt to tumors of different sizes shown in Figure 7. More specifically, PRFEM consists of three parallel 3 × 3 convolution branches with different dilation ratios, one 1 × 1 convolution branch, and one identity branch. First, for each branch with dilated convolution, we use 1 × 1 convolution to reduce the channel number of dilated convolution to a quarter of the input feature map, ensuring an even distribution of information across different scales. Then, the 1 × 1 convolution branch obtains the association of image details and enhanced position information. As an example, for a dilated convolution with a dilation rate of *R* = (1, 3, 5) and a convolution kernel of 3 × 3, the corresponding padding is set to *P* = (1, 3, 5), so that the resolution of the feature map remains unchanged, as per formula (6). We stitched together the sampling results from different branches in terms of channel to obtain multi-scale information representation. Finally, the identity connection is used to optimize the gradient information propagation and lower the training difficulty. After each convolutional layer, BN and SiLU are performed.

**Figure 7.** Tumor targets under different receptive fields. Matching small receptive fields with large tumors may lead to inaccurate results for network classification (**bottom-left image**), and matching large receptive fields with small tumors may cause the network to focus more on background information and ignore small-sized tumors (**top-right image**).

#### **4. Experiments and Results**

#### *4.1. Dataset and Evaluation Metrics*

We randomly sampled 2324 CT images (1137 containing adenocarcinoma tumors and 1187 containing small cell carcinoma tumors) from the CT data provided by Lung-PET-CT-Dx [51] for the training, validation, and testing of the proposed model. These images were collected retrospectively from suspected lung cancer patients. Category information and location information for each tumor was annotated by five experienced radiologists using the LabelImg tool. The CT images provided in the dataset are in the Digital Imaging and Communications in Medicine (DICOM) format. We performed pre-processing on the DICOM format of the CT images to enable their use in the proposed ELCT-YOLO model. The image pre-processing operation flow is illustrated in Figure 8.

**Figure 8.** The image pre-processing operation flow.

First, we read the DICOM files to obtain the image data. Next, we used a Reshape operation to adjust the coordinate order and generate a new matrix. Then, we normalized the pixel values by subtracting the low window level from the original pixel values and dividing by the window width to improve contrast and brightness. After that, we used pixel mapping to map the gray values to an unsigned integer pixel value between 0 and 255. Through these steps, we successfully adjusted the pixel values and generated the corresponding PNG images.

The 2324 CT images were split into training, validation, and test sets at a ratio of 6:2:2. The choice of this ratio is based on the size of the utilized dataset, taking into account the experience of previous researchers. Another common ratio is 8:1:1. However, using an 8:1:1 ratio in our case would result in an insufficient number of samples in the validation and testing sets, which may not fully reflect the model's generalization ability on real-world data. Additionally, if the number of samples in the testing set is too small, the evaluation results of the model may be affected by randomness, leading to unstable evaluation results. Therefore, we chose the 6:2:2 ratio.

For performance evaluation of the models compared with respect to tumor detection, we used common evaluation metrics, such as *mAP*@0.5: the mean average precision (IoU = 0.5), *precision* (*P*), and *recall* (*R*), defined as follows:

$$P = \frac{TP}{TP + FP} \tag{8}$$

$$R = \frac{TP}{TP + FN} \tag{9}$$

$$AP = \int\_0^1 PdR\tag{10}$$

$$mAP = \frac{\sum\_{i=1}^{N} AP\_i}{N} \tag{11}$$

where *TP*, *FP*, and *FN* denote the number of correctly detected, incorrectly detected, and missed tumor cases presented in images, respectively. *mAP* is obtained by averaging the corresponding *AP* values for each category (in our case, *N* = 2 represents the two considered tumor categories of adenocarcinoma and small cell carcinoma).

#### *4.2. Model Training*

ELCT-YOLO is based on the implementation of open-source YOLOv7. Training of the model was conducted on a Linux system running Ubuntu 20.04.1, utilizing an RTX2080Ti GPU for accelerated computation. The model training utilized a batch size of 16, and an input image size of 512 × 512 pixels was specified. The stochastic gradient descent (SGD) optimizer was adopted in the model training, with the initial learning rate and momentum default value being 0.01 and 0.937, respectively. The following adjustment strategy was used for the learning rate of each training round:

$$df = 1 - \frac{1}{2}(1 - lr) \times (1 - \cos\frac{i \times \pi}{\varepsilon \text{poch}}) \tag{12}$$

where *i* denotes the *i*-th round, *lrf* denotes the final OneCycleLR learning rate multiplication factor which is set to 0.1, *l f* denotes the multiplier factor for adjusting the learning rate, and *epoch* represents the current training round We used mosaic enhancements to load images and corresponding labels. In addition, we did not load the weight file trained by YOLOv7 on the MS COCO dataset during the training process. This is because there is a huge difference in the domain between ordinary natural images and medical images, and migration did not result in desired results [52]. To minimize the impact of randomness on the evaluation results, we divided the dataset into five equally sized and mutually exclusive parts. We repeated the experiments five times and selected the highest peak of the average value as the final result. As Figure 9 illustrates, the proposed ELCT-YOLO model achieved stable convergence after 120 rounds of training.

**Figure 9.** The mAP variation curves during the ELCT-YOLO training.

The loss function (*L*) of the ELCT-YOLO model, used to calculate the loss between the predicted boxes and the labels of the matched grids, was the following:

$$L = \lambda\_1 \times L\_{obj} + \lambda\_2 \times L\_{cls} + \lambda\_3 \times L\_{box} \tag{13}$$

where *L*obj, *L*cls, *L*box denote the coordinate loss, classification loss, and objectness loss, respectively. The values of *λ*1, *λ*2, and *λ*<sup>3</sup> are equal to 0.7, 0.3, and 0.05, respectively. *Lobj* and *Lcls* use binary cross-entropy to calculate the objectness and classification probability losses, while *Lbox* uses the Complete Intersection over Union (CIoU) to calculate the regression loss of bounding boxes [53]. The CIoU loss function not only considers the aspect ratio, overlap area, and center distance but also includes a penalty term, and is expressed as follows:

$$L\_{CIoI} = 1 - IoI + \frac{\rho^2 \left(b\_\prime b^{\xi t}\right)}{c^2} + a\nu \tag{14}$$

where IoU represents the intersection over union between the predicted box *b* and the ground truth box *bgt*, *ρ*<sup>2</sup> \$ *b*, *bgt*% represents the Euclidean distance between the center point of the predicted box and the center point of the ground truth box [54], and *αν* is the penalty term that ensures that the width and height of the predicted box quickly approach those of the ground truth box. The values of *α* and *ν* are calculated as follows:

$$\alpha = \frac{\nu}{(1 - IoI) + \nu} \tag{15}$$

$$\nu = \frac{4}{\pi l^2} (\arctan \frac{w^{\otimes t}}{h^{\otimes t}} - \arctan \frac{w}{h})^2. \tag{16}$$

where *w* and *h* denote the width and height of the predicted bounding box, respectively, while *wgt* and *hgt* denote the width and height of the ground truth bounding box, respectively.

The convergence curves of the confidence loss, classification loss, and regression loss during the process of training and validation of the ELCT-YOLO model are presented in Figures 10 and 11. It can be observed that with more training iterations, the losses of ELCT-YOLO on the training set continuously decrease, indicating the model's ability to fit the distribution of the training data. On the other hand, the losses on the validation set reflect the detection model's good generalization ability on unknown data.

**Figure 10.** The loss curves during the ELCT-YOLO training.

**Figure 11.** The loss curves during the ELCT-YOLO validation.

*4.3. Comparison of ELCT-YOLO with State-of-the-Art Models*

We compared the proposed ELCT-YOLO model with six state-of-the-art models, namely YOLOv3, YOLOv5, YOLOv7-tiny, YOLOv8, SSD, and Faster R-CNN. The obtained results are shown in Table 1

**Table 1.** Performance comparison of models used for lung tumor detection in CT images (the best value achieved among the models for a particular metric is shown in **bold**).


As evident from Table 1, the proposed ELCT-YOLO model is the winner based on *recall*, while also having the smallest size among the models. According to *mAP*, it takes second place by closely following the winner (YOLOv8, which was also trained for 120 epochs with a batch size of 16) and scoring only 0.003 points less, but its size is almost half the YOLOv8 size. Regarding the achieved *precision*, ELCT-YOLO does not perform so well; it occupies fourth place by scoring 0.044 points less than the winner (YOLOv8). Regarding the FPS, the proposed ELCT-YOLO model takes second place, closely following the winner (YOLOv7-tiny) by processing only 2 frames less per second.

As mentioned in the Introduction, devices used in real medical scenarios are often resource-constrained, so smaller-size models are more appropriate for use. ELCT-YOLO achieves a good balance between accuracy and efficiency in tumor detection. In addition, in terms of *recall*, ELCT-YOLO achieved best results among the models compared, which is conducive to tumor screening, as the higher the *recall* value, the fewer tumors will be missed, and the detector will find every potential lesion location.

Figure 12 illustrates sample images from the test set along with their corresponding detection results. It is evident from the figure that ELCT-YOLO performs well in detecting tumors of varying scales in CT images.

**Figure 12.** Sample images, extracted from the test set, with lung tumors detection results (label A represents adenocarcinoma, whereas label B represents small cell carcinoma). For irregular tumors with patchy heightened shadows or tumors with obvious pleural traction, EACT-YOLO can effectively reduce the interference of background information and distinguish the tumor target from the background.

#### *4.4. Ablation Study of ELCT-YOLO*

To further evaluate the impact of using the designed DENeck and CRS structures, and of integrating the SPPF module [25] into YOLOv7-tiny, we performed an ablation study on these performance improvement components, the results of which are shown in Table 2.


**Table 2.** Results of the ablation study performed on YOLOv7-tiny performance improvement components, used by ELCT-YOLO (the best mAP and size values achieved are shown in **bold**).

First of all, the SPPF module of YOLOv5 [25], which we introduced at the top of the original YOLOv7-tiny backbone network, did not lead to a significant improvement in the mean average precision (*mAP*), but the model size was reduced by 7%. Then, using the designed DENeck structure alone enabled improving the *mAP* from 0.955 to 0.968, while the model size equaled that when using the SPPF module alone. Our belief has been confirmed that in the case of medical images, particularly CT images, precision in lesion detection can be improved by reducing confusing details. Using the designed CRS structure alone did not provide better results than using DENeck alone but led to improving the *mAP* value from 0.955 to 0.966, compared to the original YOLOv7-tiny model, though increasing the model size. This is because the shallow prediction branch needs effective global information to distinguish tumor regions from the background. When we integrated all three components, the *mAP* value exceeded that of applying any one of these components alone, while also keeping the model size very close to the minimum reached when using only SPPF or DENeck, which proves the rationality of the design of the ELCT-YOLO model proposed.

#### *4.5. CRS Study*

The designed CRS structure, described in Section 3, consists of two modules—SRFEM and PRFEM. In order to achieve a more effective receptive field, we studied the use of different cascade schemes, the results of which are shown in Table 3, where SRFEM and PRFEM are denoted as P and S, respectively. The CRS study was based on *precision*, *recall*, and the *mAP*.

**Table 3.** Performance comparison of different cascade schemes (the best value achieved among the schemes for a particular metric is shown in **bold**).


As can be seen from Table 3, using different cascade schemes (different P-S combinations) led to different values of the evaluation metrics used for the comparison. The PPP scheme performed worst according to all metrics. This may be due to the lack of receptive fields in low-contrast scenes, which is key to the improvement of detection performance, although PRFEM can capture multi-scale information from CT images to improve the ability to detect tumors. Overall, SSS is the best-performing scheme based on two of the evaluation metrics, i.e., *recall* and the *mAP*, reaching 0.957 and 0.974, respectively. The use of SSS can effectively enhance the receptive field of shallow branches, thereby improving the detection performance. Thus, this scheme was utilized by the proposed ELCT-YOLO model in the performance comparison with the state-of-the-art models (c.f., Table 1).

In addition, we verified the effect of using different dilation rates on the SSS cascaded scheme, in order to further improve the feature map quality. We considered three cascades of dilation rates: Natural numbered Series (NS), Odd numbered Series (OS), and Even numbered Series (ES). In Table 4, *RNS* = (1, 2, 3), *ROS* = (1, 3, 5), *RES* = (2, 4, 6) represent the values of these three series, respectively. Section 2 mentioned that the sparse sampling of dilated convolutions can easily lead to the loss of details. Therefore, choosing an appropriate sampling rate is also a way to alleviate the gridding effects. The comparison results in Table 4 show that *ROS* = (1, 3, 5) outperforms the other two schemes, according to all three evaluation metrics.

**Table 4.** Performance comparison of using different dilation rates in the SSS cascade scheme (the best value achieved among the dilation rates for a particular metric is shown in **bold**).


The sampling positions of the three consecutive dilated convolutions are visualized in Figure 13. It can be seen intuitively that when the dilation rate is *RNS* = (1, 2, 3) the SSS module can only obtain less receptive fields and cannot capture global information; when the dilation rate is *RES* = (2, 4, 6), the receptive field increases, but the feature information is not continuous, which will lead to the loss of details. The combination of *ROS* = (1, 3, 5) covers a larger receptive field area without losing edge information. This is consistent with our experimental results.

**Figure 13.** The effective receptive fields generated by different dilation rates in the SSS cascade scheme: (**a**) *RNS* = (1, 2, 3); (**b**) *RES* = (2, 4, 6); (**c**) *ROS* = (1, 3, 5). The colors depicted at distinct positions within the graphs indicate the frequency at which each location was utilized in computing the receptive field center.

#### *4.6. DENeck Study*

To verify the effectiveness of DENeck, we applied it on the YOLOv7-tiny model along with traditional feature fusion methods (each applied separately from the rest). The comparison results are shown in Table 5. The main focus of this experiment was to compare the impact of various feature fusion methods on the detection performance based on different topological structures. The proposed DENeck module achieved the best detection performance among the compared methods. Comparing FPN to PANet and BiFPN, we found that the latter two outperform FPN. This is because the feature fusion in FPN is insufficient, and it is difficult to extract precise localization information of tumors.

Furthermore, in order to demonstrate the generalization ability of the designed DE-Neck structure under different scale networks, we evaluated its performance for detecting tumors in models with different depths. The obtained comparison results are shown in Table 6. We used three basic networks of YOLOv7: YOLOv7-tiny, YOLOv7, and YOLOv7x. The depth of these networks is gradually deepened in the stated order.


**Table 5.** Comparisons between DENeck and traditional feature fusion methods (the best value achieved among the methods for a particular metric is shown in **bold**).

**Table 6.** Performance comparison of combining different scale networks with the designed DENeck structure (the best value achieved among the networks for a particular metric is shown in **bold**).


Table 6 shows that increasing the model scale improves the *mAP*, but the improvement is not significant—only by 0.006 points. This shows that, while the DENeck structure can be utilized by deepened backbones, its usage is more effective on lightweight networks that enable reducing the model size.

#### **5. Conclusions and Future Work**

This paper has proposed an efficient one-stage ELCT-YOLO model based on improvements introduced into the YOLOv7-tiny model, for lung tumor detection in CT images. Unlike existing neck structures, the proposed model aims to obtain multi-scale tumor information from the images. Firstly, a novel Decoupled Neck (DENeck) structure has been described for use in ELCT-YOLO to reduce semantic conflicts. More specifically, the model's neck was divided into high-semantic layers and low-semantic layers, in order to generate clearer feature representations by decoupling the fusion between these two semantic types. The conducted experiments proved that DENeck can be integrated well into backbone networks of different depths, while also showing outstanding robustness. Secondly, a novel Cascaded Refinement Scheme (CRS), configured at the lowest layer of the decoupling network, has been described for use in ELCT-YOLO in order to capture tumor features under different receptive fields. The optimal CRS structure was determined through another set of experiments. In addition, the problem of sparse sampling caused by dilated convolution has been considered and the effect of different receptive field combinations on the cascaded modules has been compared by means of experiments. Thirdly, it has been proposed to integrate the SPPF module of YOLOv5 at the top of the original YOLOv7-tiny backbone network in order to extract important context features, further improve the model's operational speed, and enrich the representation ability of feature maps. Extensive experiments, conducted on CT data provided by Lung-PET-CT-Dx, demonstrated the effectiveness and robustness of the proposed ELCT-YOLO model for lung tumor detection.

The presented study has focused on addressing the multi-scale issue of tumor detection using a lightweight model. The model still needs further optimization in reducing both the number of parameters and computational complexity. As a next step of the future research, we will use network distillation techniques and existing lightweight convolutional modules to construct a simpler model, aimed at reducing the inference latency and parameters' number. In addition, the study presented in this paper has only focused on tumor detection tasks based on CT images. In fact, some emerging technologies such as super-wideband microwave reflection measurement are more user friendly and cost effective than traditional detection techniques such as the CT-based ones [55]. In the future, we will also focus on studying emerging technologies for lung cancer detection more in depth.

**Author Contributions:** Conceptualization, J.Z. and Z.J.; methodology, X.Z. (Xueji Zhang); validation, I.G. and H.Z.; formal analysis, J.Z. and X.Z. (Xinyi Zeng); writing—original draft preparation, J.Z.; writing—review and editing, I.G.; supervision, J.L.; project administration, X.Z. (Xueji Zhang) and Z.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This publication has emanated from joint research conducted with the financial support of the S&T Major Project of the Science and Technology Ministry of China under the Grant No. 2017YFE0135700 and the Bulgarian National Science Fund (BNSF) under the Grant No. KΠ-06-ИΠ-KИTAЙ/1 (KP-06-IP-CHINA/1).

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

**Qiyan Wang 1,† and Yuanyuan Jiang 2,\*,†**


**Abstract:** Leisure time is crucial for personal development and leisure consumption. Accurate prediction of leisure time and analysis of its influencing factors creates a benefit by increasing personal leisure time. We predict leisure time and analyze its key influencing factors according to survey data of Beijing residents' time allocation in 2011, 2016, and 2021, with an effective sample size of 3356. A Light Gradient Boosting Machine (LightGBM) model is utilized to classify and predict leisure time, and the SHapley Additive exPlanation (SHAP) approach is utilized to conduct feature importance analysis and influence mechanism analysis of influencing factors from four perspectives: time allocation, demographics, occupation, and family characteristics. The results verify that LightGBM effectively predicts personal leisure time, with the test set's accuracy, recall, and F1 values being 0.85 and the AUC value reaching 0.91. The results of SHAP highlight that work/study time within the system is the main constraint on leisure time. Demographic factors, such as gender and age, are also of great significance for leisure time. Occupational and family heterogeneity exist in leisure time as well. The results contribute to the government improving the public holiday system, companies designing personalized leisure products for users with different leisure characteristics, and residents understanding and independently increasing their leisure time.

**Keywords:** data analysis; classification; decision trees; LightGBM; SHAP; leisure time; influencing factors; time allocation

**MSC:** 68T09

#### **1. Introduction**

Individuals have an increasing desire for a better quality of life as the economy develops and tangible prosperity grows. Leisure has steadily become a common way of life and an integral component of individuals' aspirations for a fulfilling existence [1]. Leisure time is an indispensable prerequisite for achieving people's independence and all-round development [2,3]. People are eager for a higher level of material cultural life, a richer leisure and entertainment life, and realizing personal ideals in the fields of politics, economics, culture, and so on [4]. As a result, it is imperative to ensure that all residents have enough leisure time to enrich themselves [5] and succeed in their endeavors. In addition, ample leisure time is required for people to participate in leisure activities and enjoy leisure life [6]. Leisure activities provide opportunities for individuals to engage in leisure consumption, which, in turn, can stimulate innovation in consumption patterns and drive economic growth. Encouraging leisure consumption necessitates the provision of leisure time, as leisure time is a prerequisite for leisure consumption activities [7,8].

However, it appears that "money" and "leisure" have become diametrically opposed existences [9]. While material life has been gradually enriched, "time poverty" exists [10], which is particularly noticeable in China. The "996 working system" sparked a heated debate on the Internet in 2019 [11], with some related topics on the Sina Weibo platform

**Citation:** Wang, Q.; Jiang, Y. Leisure Time Prediction and Influencing Factors Analysis Based on LightGBM and SHAP. *Mathematics* **2023**, *11*, 2371. https://doi.org/10.3390/math11102371

Academic Editors: Snezhana Gocheva-Ilieva, Atanas Ivanov and Hristina Kulina

Received: 3 April 2023 Revised: 8 May 2023 Accepted: 10 May 2023 Published: 19 May 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

being viewed more than 100 million times. In March of this year, Visual Capitalist released a survey report indicating that China is among the ten countries with the lowest total number of paid leave days, which includes public holidays and paid vacation days [12]. China's current vacation system not only offers a limited number of paid vacation days but also suffers from a low implementation rate. A survey conducted by the Ministry of Human Resources and Social Security of China in 2015 revealed that the implementation rate of paid leave in China is approximately 50%. Furthermore, there exists a certain dissatisfaction among the public towards the current vacation system. In 2019, the Leisure Economics Research Center of Renmin University of China conducted a survey on the satisfaction of Beijing residents with the vacation system. The results showed that 46% of respondents expressed dissatisfaction with the leave-in-lieu policy, and over 50% were unhappy with the current vacation system. These findings suggest that the current vacation system in China is at odds with people's exponentially growing desire for leisure. To address this problem, it is crucial to delve into the various factors that impact leisure time and identify ways to enhance the vacation system to better fulfill people's leisure requirements. Achieving a harmonious balance between work and leisure can yield benefits for both individuals and society, such as improved productivity [3] and overall well-being [13].

Based on this situation, we should concentrate on social reality, explore the causes of the conflict between material abundance and time poverty, and analyze how to provide residents with as much leisure and freedom as possible while maintaining the smooth progress of the comprehensive pursuit of a prosperous society and the construction of a better life. Thus, it is highly important to examine the factors that impact changes in residents' leisure time.

This paper sheds light on the dynamics of leisure time and highlights key factors that affect individuals' leisure time from the perspective of machine learning. Understanding these factors can benefit not only individuals in making informed decisions about achieving a healthy work–life balance, but it can also provide valuable insights for markets to gain insight into consumer needs and for governments to develop policies that support the development of individuals, businesses, and the economy. The main contributions of this paper are as follows. The first is to apply a Light Gradient Boosting Machine (LightGBM) model and the SHapley Additive exPlanation (SHAP) approach based on game theory to analyze the factors influencing leisure time from the standpoint of the nonlinear relationship. The extant literature is primarily based on linear models [14] and lacks the exploration of nonlinear relationships. Second, we conduct thorough data analysis on primary survey data collected in 2011, 2016, and 2021, while most of the previous studies used secondary data. Third, as far as we know, this paper is the first to study the changes in leisure time based on time allocation, demographics, occupation, and family characteristics, as compared to previous research that explored the correlation between a specific factor and leisure time [3,14,15]. Last but not least, based on the conclusions of this paper, we discuss feasible measures to increase and make full use of personal leisure time from three aspects: government policy system, market product supply, and personal leisure demand.

The remainder of this paper is structured as follows. Section 2 provides an overview of relevant research on defining leisure time as well as its macro- and micro-influencing factors. Section 3 describes the LightGBM model and the SHAP approach in summary. Section 4 introduces the data sources. Section 5 demonstrates the LightGBM model construction process and evaluation metrics. Section 6 presents the empirical results of SHAP and delves into the effects of various factors on the changes in leisure time, as well as the interaction effects between the factors. Finally, Section 7 concludes and discusses some measures for increasing personal leisure time.

#### **2. Literature Review**

#### *2.1. Definition of Leisure Time*

Existing research has not provided a distinct and consistent definition of leisure time. Some scholars consider leisure time as the remaining time after subtracting work time, housework time, and personal care time from total time [16–18]. Some scholars emphasize the importance of "free choice" in leisure time [19]. Leisure time, according to Leitner M.J. and Leitner S.F. (2004), is free or unrestricted time that excludes the time required for work and personal life (such as sleep) [20]. Žumárová (2015) defines leisure time as a period when you can freely choose activities and have a pleasant experience [21]. Some scholars place greater emphasis on personal subjective emotions and believe that leisure time is beneficial to physical and mental health [22–24], so leisure time is different for everyone [25]. According to Seibel et al. (2020), leisure time is time spent away from employment and is subject to personal subjective decision-making [26]. Some scholars define leisure time in terms of the activities they participate in, believing that leisure time is the time to engage in activities that we can choose whether or not to do, and no one else can do the activities for us [27,28].

According to existing studies, leisure time has the following attributes. First, it excludes time spent on work, housework, and personal care. Second, individuals are free to engage in leisure pursuits during leisure time. Third, leisure time can bring pleasant experiences. This paper defines leisure time based on four categories of total time; that is, daily time is divided into four categories based on the attributes of activities: work/study time (including round-trip time), the essential time for personal life (including the time for sleep, meals, personal hygiene, healthcare, and other necessary activities), housework time, and leisure time. Leisure time consists primarily of time spent acquiring cultural and scientific knowledge, reading newspapers, watching television, and engaging in a variety of other leisure activities. The definition in this paper conforms to the three attributes of leisure time mentioned above.

#### *2.2. Micro-Influencing Factors of Leisure Time*

With regards to the **time allocation characteristics,** since the total amount of time remains unchanged, leisure time is squeezed out by work/study time within the system (excluding commuting time and overtime), commuting time, essential time for personal life, and housework time [29–32].

With regards to the **demographic characteristics**, the issue of gender inequality in leisure time has received a lot of attention [33]. It is debatable whether there is gender inequality in leisure. While some scholars initially argued that gender equality in leisure could exist [34,35], further studies have shown significant gender inequality in leisure, with men having significantly more leisure time than women being verified by a descriptive statistical analysis of survey data of employed men and employed women in Lithuanian [36], and there is a noteworthy disparity between the quality of leisure time experienced by women and men, with women reporting significantly lower levels based on a multilevel regression of data from the International Social Survey Programme [37]. In-depth interviews with 12 representative mothers show that women adjust their leisure time based on the preferences of their partners and children [38]. A comparative analysis of survey data from Germany and the UK indicates that compared with men, women tend to undertake more housework [39]. Age and marital status have also been identified as factors that affect leisure time [40]. Adolescents and retirees are found to have the most leisure time by using the function of leisure participation [41], while there are also some opinions that leisure time increases with age in adults based on statistical tests [42]. Previous studies have also discovered that there are significant age differences in leisure activity participation [2,43], and the participation rate of leisure activities decreases with age [44]. Furthermore, educational level is also found to be linked to leisure activities [45,46]. Thus, the educational level should also be considered in relation to leisure time.

With regards to **occupational characteristics**, occupational characteristics have been found to correlate with leisure activity participation through a review of a series of representative literature [47]. Occupational characteristics can be considered from multiple perspectives. Occupational category, for example, represents the basic characteristics of the occupation according to occupational classification standards [48]. Enterprise ownership and company size (number of employees), as part of the company's own organizational

characteristics [49], reflect the environmental characteristics of the occupation. Furthermore, the weekly rest situation reflects the overtime characteristics of the occupation [50]. Consequently, the aforementioned four features may be correlated with leisure time. For instance, it has been found that individuals in different occupational categories have noticeable differences in both leisure time and leisure activities [47,51,52]. However, as far as we are aware, there have been limited studies looking into the effects of all the aforementioned occupational characteristics on leisure time.

With regards to **family characteristics**, household income has always been a factor of concern to scholars [53]. The exact effect of household income on leisure time is a matter of debate. Lee and Bhargava (2004) argue that household income is a determinant of leisure time [40]. A linear regression of survey data from college students shows that leisure time is positively influenced by household income [54]. However, multiple regression results based on Australian Time Use Survey data indicate that household income has no significant effect on leisure time [6]. Additionally, having someone to care for in the home tends to affect leisure time as well [55]. When the number of children in need of care at home increases, women's time to care for children will increase accordingly [56].

#### *2.3. Macro-Influencing Factors of Leisure Time*

Everyone in one area is subject to the same external environment, which comprises macro-influencing factors. Hence, this paper focuses on studying the endogenous influencing factors of leisure time from a micro perspective, that is, concentrating on the effects of residents' personal characteristics on leisure time, with only qualitative analysis of the macro-influencing factors conducted.

**Holiday system**: The holiday system is a necessary precondition for limiting leisure time. Since China switched from a single day off to a double day off per week on 1 May 1995, Chinese residents' leisure time has grown significantly [57]. In general, the more legal holidays there are, the more leisure time people have. York et al. (2018) find that China's Golden Week holiday system has become an important source of leisure time [58]. The continuous and gradual implementation of paid leave could contribute to further increasing the leisure time of the entire population [59].

**Productivity**: It has been found that scientific and technological progress and productivity development can increase leisure time within a certain period of time [60]. Dridea and Sztruten (2010) believe that leisure time can serve as an indicator reflecting the productivity of a developed society, and the increase in labor productivity will lead to an increase in leisure time [61]. Min and Jin (2010) claim that the remarkable progress in productivity has liberated people from heavy work and resulted in more leisure time [62].

In summary, despite the extensive literature on leisure time, there are still several limitations. First, studies on changes in Chinese residents' leisure time and factors affecting leisure time at the micro level are relatively scarce [63,64]. Second, the current literature lacks primary data and mainly relies on secondary cross-sectional data. Third, most of the literature is based on descriptive statistics or linear models, which limits the ability to explore the nonlinear relationships between features. To address these issues, this paper utilizes real and effective primary survey data gathered from the Beijing Residents' Time Allocation Survey, which was carried out by the Leisure Economy Research Center of Renmin University of China in 2011, 2016, and 2021. In light of the survey data, we explore the changes in the leisure time of Chinese residents and its main influencing factors from multiple perspectives, including time allocation characteristics, demographic characteristics, occupational characteristics, and family characteristics. To evaluate and describe the nonlinear relationship between leisure time and these factors, a LightGBM model and the SHAP approach are employed.

#### **3. Methods**

We utilize LightGBM (Light Gradient Boosting Machine) to classify leisure time into two categories and employ the SHAP (SHapley Additive exPlanation) approach to quantify the effects of the factors influencing leisure time. LightGBM is known for its high efficiency and performance across a range of scenarios [65], including classification, regression, ranking, time-series prediction, and so on [66,67]. SHAP is a game-theory-based method for decomposing model predictions into the aggregate of the SHAP values and a fixed baseline value [68,69], and it is widely used in the explanation of various models [70,71].

#### *3.1. Light Gradient Boosting Machine (LightGBM)*

The Gradient Boosting Decision Tree (GBDT) has long been popular in academia and industry [72]. Based on the GBDT algorithm framework, Microsoft created LightGBM (Light Gradient Boosting Machine), a more efficient and precise gradient boosting algorithm framework [73].

Let X *<sup>p</sup>* be the input space and G be the gradient space. Suppose we have a training set consisting of *n* i.i.d. instances (*x<sup>T</sup>* <sup>1</sup> , *<sup>y</sup>*),...,(*x<sup>T</sup> <sup>n</sup>* , *y*) , where *xi* is a vector and *xT <sup>i</sup>* = (*xi*1, ... , *xip*), *p* is the number of features. The predicted value *f*(*x*) of GBDT for each instance is represented by *K* decision tree model *tk*(*x*) :

$$f(\mathbf{x}) = \sum\_{k=1}^{K} t\_k(\mathbf{x}) \tag{1}$$

GBDT learns the approximate function ˆ *f* of *f* by minimizing the expected value of the loss function *L*:

$$\hat{f} = \underset{f}{\text{argmin}} E\_{y,x}[L(y, x)] \tag{2}$$

When splitting internal nodes, unlike the information gain utilized in GBDT, the Gradient-based One-Side Sampling (GOSS) algorithm is employed in LightGBM. Denote the negative gradients of the loss function as {*d*1, ... , *dn*} in each iteration. Firstly, the data are split in accordance with the absolute values of the gradients. The top-*a* × 100% instances constitute set *A*, and the remaining samples are randomly sampled to form set *B*. The size of *B* is *n* × *b* × (1 − *a*) × 100%. Then, divide the node at point *m* on the set *A* ∪ *B* according to the variance gain *V*!*j*(*m*) of feature *j*:

$$\tilde{V}\_{\dot{f}}(m) = \frac{1}{n} \left( \frac{(\sum\_{x\_i \in A\_l} d\_i + \frac{1-a}{b} \sum\_{x\_i \in B\_l} d\_i)^2}{n\_l^{\dot{f}}(m)} + \frac{(\sum\_{x\_i \in A\_r} d\_i + \frac{1-a}{b} \sum\_{x\_i \in B\_r} d\_i)^2}{n\_r^{\dot{f}}(m)} \right) \tag{3}$$

where *Al* = {*xi* ∈ *A* : *xij* ≤ *m*}, *Ar* = {*xi* ∈ *A* : *xij* > *m*}, *Bl* = {*xi* ∈ *B* : *xij* ≤ *m*}, *Br* = {*xi* ∈ *B* : *xij* > *m*}.

Through the GOSS algorithm, LightGBM only needs to perform computations on small samples, which greatly saves computation time. In addition, LightGBM further improves computational efficiency through the Histogram algorithm and the Exclusive Feature Bundling (EFB) algorithm. In comparison to eXtreme Gradient Boosting (XGBoost), which computes all objective function gains of each feature at all possible split points based on all instances and then selects the feature and split point with the largest gain [74], LightGBM optimizes computation from three aspects: reducing the quantity of possible split points by the Histogram algorithm, decreasing the sample size by means of the GOSS algorithm, and trimming the feature set through the EFB algorithm [73]. The higher efficiency of LightGBM has been verified by experiments [65,75].

#### *3.2. SHapley Additive exPlanations (SHAP)*

Although LightGBM has significant benefits in terms of computing efficiency and prediction accuracy, it is essentially a black-box model that can only show the order of importance of features but cannot output the specific impact of features on prediction results. As a consequence, we employ SHapley Additive exPlanations (SHAP) for analysis of the LightGBM results. The SHAP approach is an algorithm framework for the post-hoc explanation of complex black-box models. It can quantify the effects of each feature in shaping the projected outcome [69].

Let *f* represent the original black-box model to be explained, *g* represent the explanation model based on SHAP, *x* represent an instance, and *x* represent the simplified input of *x*; there is a mapping relationship between them such that:

$$h\_{\mathbf{x}}(\mathbf{x}') = \mathbf{x} \tag{4}$$

It should be ensured that *g*(*z* ) ≈ *f*(*hx*(*z* )) whenever *z* ≈ *x* in local methods. Based on this, the additive model can be utilized to give attributes to the original prediction model:

$$\log(z') = \Phi\_0 + \sum\_{j=1}^{P} \Phi\_j z'\_j \tag{5}$$

where *<sup>z</sup>* <sup>∈</sup> {0, 1}*P*, *<sup>P</sup>* is the number of simplified input variables, and <sup>Φ</sup>*<sup>j</sup>* <sup>∈</sup> <sup>R</sup>. It can be seen from Equation (5) that SHAP attributes the contribution of feature *j* to Φ*j*, which is the Shapley value of feature *j*, and Φ<sup>0</sup> is a constant term.

It should be noted that when using LightGBM to make predictions, we cannot directly employ Equation (5) but need to perform logarithmic probability conversion on *g*(*z* ):

$$\ln \frac{g(z')}{1 - g(z')} = \Phi\_0 + \sum\_{j=1}^{P} \Phi\_j z'\_j \tag{6}$$

Let *F* denote the set including all features and *S* ⊆ *F* denote the subset. To calculate <sup>Φ</sup>*j*, *fS*∪{*j*}(*xS*∪{*j*}) and *fS*(*xS*) should be trained, where *xS*∪{*j*} are the values of the features in *S* ∪ {*j*} and *xS* are the values of the features in *S*. Φ*<sup>j</sup>* is then computed [76,77]:

$$\Phi\_{\bar{\jmath}} = \sum\_{S \subseteq F \backslash \{\bar{\jmath}\}} \frac{|S|!(|F|-|S|-1)!}{|F|!} \left[ f\_{\mathbb{S}\cup \{\bar{\jmath}\}}(\mathbf{x}\_{\mathbb{S}\cup \{\bar{\jmath}\}}) - f\_{\mathbb{S}}(\mathbf{x}\_{S}) \right] \tag{7}$$

The complexity of calculating Φ*<sup>j</sup>* by Equation (7) is *O*(*KN*2|*F*<sup>|</sup> ); in order to improve the computational efficiency, a TreeExplainer method for explaining the decision tree model is proposed [78,79], which reduces the complexity to *O*(*KND*2), where *K* is the number of trees, *N* is the peak node count among the trees, and *D* is the greatest depth of all trees.

Lundberg et al. (2018) [77] calculate pairwise interaction effects by extending SHAP values on the basis of the Shapley interaction index in game theory:

$$\Phi\_{\mathbf{j},\mathbf{j}} = \sum\_{S \subseteq F \backslash \{\mathbf{i},\mathbf{j}\}} \frac{|S|!(|F|-|S|-2)!}{2(|F|-1)!} \left[ f\_{\mathbb{S}\cup\{\mathbf{i},\mathbf{j}\}}(\mathbf{x}\_{\mathbb{S}\cup\{\mathbf{i},\mathbf{j}\}}) - f\_{\mathbb{S}\cup\{\mathbf{i}\}} - f\_{\mathbb{S}\cup\{\mathbf{j}\}} + f\_{\mathbb{S}}(\mathbf{x}\_{\mathbb{S}}) \right] \tag{8}$$

#### **4. Data Preparation**

#### *4.1. Data Source and Processing*

The data we analyzed are from the Beijing Residents' Time Allocation Survey, which was conducted in 2011, 2016, and 2021 by the Leisure Economy Research Center of Renmin University of China. The corresponding effective sample sizes are 1106, 830, and 1597, respectively. We are in charge of questionnaire design, investigation, and data analysis. The sampling method is multi-stage random sampling. The questionnaire adopts a self-filling structure, which consists of a tripartite structure. The first part contains the respondents' fundamental elements, such as gender, age, and educational level. The second part comprises two daily time allocation tables for weekdays and weekends. Each table regards every 10 min as a unit; as a result, a day is divided into 144 time periods, and the respondents are required to fill in the unique items in each time period. The third part collects information about the respondents' involvement in physical exercise, cultural and recreational activities, hobbies, amateur learning, public welfare activities, and tourism

in the previous year, including the frequency, companions, and so on. The questionnaire is filled in by the respondents themselves, which is the expression of their real thoughts. Thus, the survey data are ensured to be true, objective, and accurate.

The time allocation features selected in this paper are the average daily time calculated by <sup>5</sup>×*time*(on weekdays)+2×*time*(on weekends) <sup>5</sup>+<sup>2</sup> . Leisure time is a numerical variable in the questionnaire, with a median of 237.14 minutes per day. In this paper, the survey data from three years are combined, and the "year" feature is introduced to distinguish instances from different years. The efficient sample contains no missing values, and abnormal values are deleted to prevent them from affecting the model's accuracy. First, we eliminate apparent outliers, such as the observation with an extreme age value of 159. Second, we apply the 3*σ* principle (i.e., outliers are defined as observations with a standardized score higher than three) to eliminate the outliers in the numerical features other than leisure time. Outliers of leisure time are not processed by the 3*σ* principle as they are processed in two classes during modeling, which can help avoid the influence of outliers. The sample size after outlier processing is 3356, which is utilized for further analysis in this paper. Taking the median of leisure time mentioned above as the boundary, this paper regards the observations less than the median as negative examples and the other observations as positive examples for binary classification, totaling 1639 negative cases and 1717 positive cases.

#### *4.2. Variable Description*

The dependent variable is leisure time. According to existing research, leisure time in this paper mainly denotes the time that individuals have at their disposal to engage in activities of their own choosing and bring pleasant experiences, excluding work/study time, essential time for personal life, and housework time. Moreover, leisure time in this paper consists of time for participating in recreational pursuits, including learning about culture and science; reading various forms of written media, including news, books, and magazines; watching TV, movies, and dramas; garden walks; and other leisure activities. The work/study time within the system in this paper refers to the time specified by the company/school [80], which excludes overtime and commuting time. The influencing factors of the residents' leisure time are mainly considered from four aspects: time allocation characteristics, demographic characteristics, occupational characteristics, and family characteristics. Our sample comprises students, current employees, and retirees. Thereby, in terms of the level of education, we divide people into five categories: (1) those who are not working and not students, (2) those who are still in school (students), and those who are current employees, including: (3) those who have been educated for 9 years or less, (4) those who have been educated for 9–12 years, and (5) those who have been educated for more than 12 years. Students and retirees are all classified as "not working" in each occupational characteristic because they are not presently employed. Further details are shown in Table 1.


**Table 1.** Factors influencing residents' leisure time.


#### **Table 1.** *Cont.*

#### **5. LightGBM Model Construction and Evaluation**

*5.1. Model Construction*

The models utilized in this paper are run in the environment of Python 3.7. The Light-GBM package [73], Scikit-learn package [81], and shap package [79] are applied for model training, evaluation, and explanation. The modeling process involves the following steps:


#### *5.2. Evaluation Metrics*

To evaluate the model's effectiveness, we apply five commonly utilized evaluation metrics for classification models: accuracy, precision, recall, F1-score, and AUC (Area Under Curve). First, we construct the binary classification confusion matrix as presented in Table 2 according to the model prediction results.


Then, we calculate the following five metrics. Larger values of these metrics indicate better model performance.

$$Precision = \frac{TP}{TP + FP'} \quad Recall = \frac{TP}{TP + FN} \tag{9}$$

$$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}, \quad F1 = \frac{2}{\frac{1}{Precision} + \frac{1}{Recall}} \tag{10}$$

To calculate AUC, we should calculate TRP (True Positive Rate) and FPR (False Positive Rate) first.

$$TPR = Recall = \frac{TP}{TP + FN}, \quad FPR = \frac{FP}{FP + TN} \tag{11}$$

The ROC (Receiver Operating Characteristic) curve is then drawn with FPR as the horizontal axis and TPR as the vertical axis, and AUC is calculated as the area under the ROC curve.

#### *5.3. Model Evaluations*

The confusion matrix of LightGBM on the test set is shown in Figure 1. The horizontal axis of Figure 1 represents the predicted value of 0 or 1, the vertical axis represents the actual value of 0 or 1, and the black area represents the number of misclassified instances. The ROC curve is shown in Figure 2, and AUC on the test set can be calculated as 0.91.

**Figure 1.** Confusion matrix for LightGBM on the test set.

**Figure 2.** ROC curve for LightGBM on the train set and test set.

In addition, we choose several classic models to compare their performance with LightGBM, including logistic regression (LR), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), Decision Trees (DT), Random Forests (RF), and Deep Neural Networks (DNN). The DNN model is trained by the PyTorch-tabular package [82], and to avoid overfitting, we designed a simple DNN with only two hidden layers containing 64 and 32 neurons, respectively. For the other compared algorithms, we utilize the Scikit-learn package with default parameters. It should be noted that the processing of categorical and numerical features varies when using different models. In particular, when using Decision Trees, Random Forests, and LightGBM models, there is no need to standardize numerical features. However, when using other comparison models, numerical features must be standardized. Additionally, when using LightGBM and DNN models, categorical features only require the label encoder, as the models are capable of processing them on their own. However, for the other comparison models mentioned, categorical features must be converted into dummy variables.

According to the confusion matrix, the average value of accuracy, precision, recall, and F1-score corresponding to all categories on the test set can be calculated, and the results are shown in Table 3. The models in Table 3 are ranked based on their performance from best to worst.


**Table 3.** Evaluation metrics for LightGBM and competitors on the test set.

Table 3 shows that LightGBM performs the best in terms of accuracy, precision, recall, and F1 score, followed by logistic regression and support vector machines. Although logistic regression is not a black-box model and can directly solve the coefficients of each independent variable, it has a strong assumption (assuming a log-odds relationship between features), and its estimates may still be inaccurate when this assumption is violated. Moreover, logistic regression limits the relationship between features to be monotonic by

interpreting the direction and magnitude of the coefficients before the features, whereas the relationship between features is often complex. As we found in the conclusion section of our paper, there is a U-shaped relationship between age and leisure time and an inverted U-shaped relationship between annual household income and leisure time, which logistic regression cannot capture. Furthermore, many statistical theories are based on the assumption that variables are independent of each other, which is difficult to achieve in real-world data. In this regard, tree models may be more suitable than other statistical models. To sum up, the results demonstrate that LightGBM performs well, with favorable scores across multiple metrics, indicating that the selected factors can better explain the changes in residents' leisure time. Further, SHAP is a better choice to analyze the factors influencing leisure time in Section 6 than some other explanation algorithms such as logistic regression.

#### **6. Analysis of the Changes and Influencing Factors of Leisure Time by Using SHAP**

The model built in this paper identifies the relationships between the features very well, which was verified by the excellent results of the evaluation metrics. Therefore, based on the train set, the marginal contributions of each feature for the determination of positive and negative cases are calculated according to the SHAP values of each feature so as to find out how each feature affects the dependent variable.

Beeswarm plots are introduced as a tool to analyze the factors. It should be noted that each point in a beeswarm plot represents a single instance. The different colors signify the various values of the current feature, with blue corresponding to small values and red corresponding to large values. The horizontal axis of the plot indicates SHAP values associated with the features. The magnitude of the SHAP value reflects the feature's effects on the outcome. A positive SHAP value indicates that the feature leads to a positive impact on the instance's leisure time, while a negative SHAP value indicates the opposite. Additionally, the higher the SHAP value, the more likely the instance's leisure time exceeds the median.

#### *6.1. Changes in Beijing Residents' Leisure Time over the Last 30 Years*

From Figure 3, we find that in the past 10 years from 2011 (blue) to 2021 (red), the SHAP value decreased, indicating that the leisure time of Beijing residents has decreased considerably.

**Figure 3.** Beeswarm plot of the effects of "year".

In fact, according to the data from the Beijing Residents' Time Allocation Survey in 1986, 1996, 2001, 2006, 2011, 2016, and 2021, we can see noticeable changes in leisure time. This demonstrates a trend in which leisure time grew initially and then decreased over the past 30 years in Figure 4. On 1 May 1995, China started implementing the two-day weekend system, and the implementation of the two-day weekend system greatly increased people's leisure time. Specifically, the average daily leisure time of Beijing residents increased by 1 h and 4 min between 1986 and 1996 thanks to the aforementioned institutional factor. However, under the influence of market factors, the leisure time of residents began to decrease gradually. The average daily leisure time of Beijing residents in 2021 was only 13 min longer than in 1986, while it was 51 min shorter than in 1996.

**Figure 4.** Changes in leisure time over the past 30 years (leisure time has been normalized by Min–Max).

#### *6.2. Analysis on the Influencing Factors of Beijing Residents' Leisure Time* 6.2.1. Primary Factors Restricting Leisure Time: Work/Study and Housework

The features in Figure 5 are ordered by their significance as calculated by their mean absolute SHAP values. They reflect how the factors influence leisure time in the process of modeling, resulting in the final prediction result.

As shown in Figure 5, all the time allocation characteristics, including work/study time within the system, housework time, essential time for personal life, and commuting time, substantially decrease the amount of leisure time. Work/study time within the system stands out as the most influential factor. It reveals that there are still deficiencies in the current national holiday system and that longer working hours within the system have put a significant strain on leisure time. According to the data from the "China Labor Statistical Yearbook", the average weekly working hours of Chinese urban employees have been more than 44 h from 2001 to 2020 [83].

**Figure 5.** Beeswarm plot of feature importance ranking.

Aside from time allocation characteristics, the second most significant influencing factor is age, which is one of the demographic characteristics. Additionally, occupational characteristics play a crucial role in shaping leisure time as well. Despite ranking lower in the order of feature importance, family characteristics do have an effect on leisure time.

Below is a thorough analysis of how demographic, occupational, and family factors influence leisure time.

#### 6.2.2. Age Differences and Gender Inequality in Leisure Time

Age and gender are the most important influencing factors of demographic characteristics, as depicted in Figure 6.

**Figure 6.** Beeswarm plot of the effects of demographic factors.

Because the elderly are typically retirees, they tend to have more leisure time. Instances in different gender groups are significantly distributed on both sides of the vertical axis, with males (blue) having more leisure time. It is clear that there is gender inequality in leisure time, which is supported by previous study findings [35]. The SHAP values of years of education primarily range between −0.5 and 0.25, implying that it has an effect on leisure time. As the SHAP values of marital status fluctuate around zero, its effect on leisure time is negligible.

#### 6.2.3. Occupational Heterogeneity in Leisure Time

As illustrated in Figure 7, among occupational characteristics, "enterprise ownership" (i.e., the ownership of the work unit) has the largest impact on leisure time. The color from blue to red denotes "not working, enterprises owned by the whole people, collectively owned enterprises, individual industrial and commercial households, joint ventures, wholly-owned enterprises, joint-stock enterprises and others". Persons working in enterprises owned by the whole people or in collectively owned enterprises have considerably more leisure time than those in other enterprises, which has been given little emphasis in the previous research. Regarding company size, employees of small companies (blue and purple) have more leisure time than those of large companies (red). From the perspective of occupational category, SHAP values of occupational category vary between −0.6 and 0.6, showing that different occupational categories have varying impacts on leisure time. For instance, management positions and personal occupations (red) are associated with a detrimental effect on leisure time. Additionally, having fewer than two days off per week (also shown in red) significantly reduces leisure time.

**Figure 7.** Beeswarm plot of the effects of occupational factors.

6.2.4. High Income and Caring for Others Squeezing Leisure Time

Annual household income is the most important factor among family characteristics. Figure 8 depicts the major effects of annual household income. The horizontal axis represents the values of the feature, and the left vertical axis represents the SHAP values of the feature, which quantifies the feature's influence on the LightGBM model's outcome. The color here has the same meaning as the horizontal axis, namely, corresponding to the feature values.

An upside-down U-shaped curve characterizes the relationship between annual household income and leisure time, as shown in Figure 8. The lowest (blue) and highest income (red) have a significant negative impact on leisure time, presenting a phenomenon of extremely low income and extremely high income accompanied by a lack of leisure time. When the income is less than CNY 30,000, it has a negative impact on leisure time. When the income is between CNY 30,000 and 100,000, it has a positive impact on leisure time, and this positive impact increases with income. However, when the income exceeds CNY 100,000, it begins to have a negative impact on leisure time again, and this influence increases with income.

**Figure 8.** Scatter plot of the effects of annual household income.

As for the factor "care or not", Figure 9 shows that persons without family members to care for (blue) have significantly more leisure time than others. Obviously, when there are people in need of care, it will take up a lot of time. This is consistent with the conclusions of an earlier study that found that when children with chronic diseases need home care, parents' leisure time is reduced accordingly [55].

**Figure 9.** Beeswarm plot of the effects of "care or not".

#### *6.3. Interaction Effects of the Factors Influencing Beijing Residents' Leisure Time*

To capture the interaction effects between features, we utilize SHAP's dependency plot for analysis. It can display both the primary and joint impacts of features simultaneously. The interaction effects indicate how two features jointly affect the model's

prediction, and they are displayed through the differing color-coded vertical distribution of SHAP values. The horizontal axis in the dependency plot represents the values of the main feature; the left vertical axis represents the SHAP values of the main feature, which describe the contributions of the main feature to the outcome of LightGBM; the right vertical axis is utilized to describe the interaction effects, illustrating the SHAP values of the interaction feature, and the hue transitions from blue to red as the values of the interaction feature change from small to large.

#### 6.3.1. Gender Inequality Shifts over a Decade

It is evident in Figure 10 that leisure time has decreased over the past ten years from 2011 to 2021. In 2011, women's SHAP values (red) were lower than men's (blue), showing that gender inequality was severe with respect to leisure time. In 2016, women's SHAP values (red) were spread out across the entire vertical axis, indicating that there was no significant difference in leisure time between men and women. In 2021, women's SHAP values (red) began to be distributed in the upper part of the vertical axis, showing that women's leisure time had slightly surpassed men's. The improvement from gender inequality to gender equality has depended on the sustained efforts for gender equality in all fields of society [84].

**Figure 10.** Interaction effects of "year" and "gender".

6.3.2. Gender Inequality Shifts over the Educational Level

It can be seen from Figure 11 that individuals who have graduated but are not currently employed have the most leisure time, while students and current employees have comparatively less leisure time. From the perspective of the fluctuation range of the SHAP values corresponding to the number of years of education, the impact of education level on leisure time shows a decreasing trend with increasing education level, and the direction shows a trend from negative influence to positive influence.

As shown in Figure 11, when the number of years of education is less than nine, women (red) are distributed in the lower half of the vertical axis, indicating that women in this group have less leisure time. When the number of years of education is 9–12, women (red) are uniformly distributed on the vertical axis, showing a tendency toward gender equality. When the number of years of education exceeds 12, most women (red) are distributed in the upper half of the vertical axis, indicating that among highly educated groups, women have more leisure time. This highlights that women have more leisure time as their education level rises. To sum up, education level has a significant moderating effect on gender inequality in leisure time.

**Figure 11.** Interaction effects of "education" and "gender".

6.3.3. Leisure Time Changes for Family Caregivers over a Decade

It can be seen from Figure 12 that in 2011, individuals who needed to care for family members (red) were distributed in the lower half of the vertical axis, implying that they had less leisure time. However, these individuals were distributed in the upper part of the vertical axis in 2016 and 2021, indicating that even with family members in need of care, people have begun to have more leisure time. This improvement may be attributed to economic development, technological progress, and an increase in annual household income, which have provided more advanced methods of assisting people in caring for others, such as hiring professional caregivers [85], utilizing AI intelligent nursing systems [86], etc.

**Figure 12.** Interaction effects of "year" and "care or not".

6.3.4. Positive and Negative Effects of Weekly Rest Days

Overall, Figure 13 depicts that the impact of age on leisure time follows a U-shaped pattern. Individuals between the ages of 30 and 40 are in the golden stage of striving for dreams, and they have the least leisure time; as they get older, their leisure time steadily increases.

**Figure 13.** Interaction effects of "age" and "weekly rest days".

From the color distribution of the vertical axis in Figure 13, there is a significant interaction effect between age and the weekly rest days. We have a stereotype that fewer vacation days equal less leisure time, and in general, this inference is correct. However, in some cases, fewer vacation days may contribute to a positive change in leisure time. The results in Figure 13 indicate that people between the ages of 20 and 30 who are unable to guarantee two days off weekly (red) are distributed on the upper vertical axis, meaning that they have more leisure time instead. The reason for this phenomenon may be when they must work overtime on weekends, they will look for leisure compensation at other times, such as seeking "retaliatory leisure" by reducing other time, which leads to an increase in leisure time. People aged 30–60 who are unable to take two days off weekly (red) are distributed at the bottom of the vertical axis, suggesting that being unable to take weekend breaks has a major negative impact on leisure time. In general, the failure to implement two days off weekly exerts a marked detrimental impact on leisure time.

#### **7. Conclusions and Discussions**

#### *7.1. Main Conclusions*

This paper analyzes the changes in residents' leisure time and the major influencing factors from a machine learning viewpoint in line with the survey data from Beijing residents' daily time allocation. In general, the time allocation characteristics are the most significant influencing factors. Work time within the system and housework time are the primary drivers of the substantial reduction in leisure time.

In terms of **demographic factors**, there are age heterogeneity and gender inequalities in leisure time. A U-shaped connection exists between age and leisure time. In the initial stages of life, in order to accumulate capital, individuals sacrificed more and more leisure time as they grew older. As they reach their 40s and beyond, capital accumulation increases, working hours begin to decline after reaching a peak, and leisure activities become more feasible. People's pursuit of leisure time becomes more urgent as they get older, and as work and life pressures increase, they may consciously increase their leisure time. Gender inequality is evident in leisure time, with men enjoying more leisure time than women. Women may shoulder more housework and caregiving responsibilities, resulting in a continuous erosion of their personal leisure time. Gender inequality has improved considerably over time, and by 2021, there was a trend toward gender equality. Education can reconcile gender inequality, and higher education can serve to promote gender equality.

In terms of **occupational factors**, they also have a significant influence on leisure time, especially in relation to enterprise ownership. Employees of enterprises owned by the whole people or collectively have more leisure time compared to others. This shows the differences in overtime culture under various enterprise systems, such as the long-standing "996" work system in the Internet industry, in which leisure time is severely constrained

by the high-intensity work mode. This is consistent with the conclusion that different occupation categories have different leisure time. The impact of company size is also notable, with large companies exhibiting less leisure time. Interestingly, people aged 20–30 may actively create more leisure time if they have fewer than two days off per week, possibly due to "retaliatory leisure" psychology, which is the active creation of leisure time at the cost of other time. However, in general, taking fewer than two days off per week reduces leisure time.

In terms of **family factors**, annual household income exhibits an inverted U-shaped relationship with leisure time, whereby individuals with lower incomes (<CNY 30,000) and higher incomes (>CNY 100,000) experience a decrease in leisure time. In contrast, those with annual household incomes between CNY 30,000 and 100,000 experience an increase in leisure time, and this positive impact increases as the income rises. In addition, when there is someone at home who needs to be cared for, the caregiver's leisure time is consumed. The interaction analysis of joining to "year" shows that with the development of science and technology, the crowding-out effect of taking care of others on leisure time starts to diminish.

#### *7.2. Discussion*

Leisure time not only facilitates personal self-development but also stimulates leisure consumption and promotes economic growth. In light of the conclusions of this paper, we put forward the following potential measures to increase personal leisure time.

Reform the current vacation system to ensure the adequate supply of leisure time. The conclusions indicate that leisure time is primarily influenced by working hours. At a national level, the national vacation system determines working hours within the system and serves as the main constraint on the supply of leisure time. The current vacation system can be reformed by the government to increase the overall availability of leisure time. It is important for a country's vacation system to be in sync with its economic progress; thus, on the basis of certain increase in labor productivity, the length of legal holidays could be appropriately increased. Additionally, to prevent the occurrence of leave-in-lieu and alleviate worker fatigue, one potential solution in developed cities is to implement a four-day-week system , with possible adjustments made based on the actual situation of different enterprises or regions. Furthermore, the results show that having fewer than two days off weekly significantly reduces leisure time, highlighting the challenge of implementing the existing vacation system and ensuring an adequate supply of leisure time. The "996 work system" has even become the standard configuration of Internet companies, and it is difficult to implement both paid leave and two days off per week. To address this problem, relevant reward and punishment policies should be issued to encourage the realization of legal holidays.

Provide personalized leisure products to promote the upgrading of leisure consumption. The results of this paper show that demographic characteristics such as age and gender have a significant impact on leisure time. Accordingly, enterprises can perform user clustering based on the characteristics of various groups and provide personalized leisure products to satisfy different consumer demands. At the market level, material guarantees are necessary to meet people's leisure consumption needs. For example, the conclusions of this paper demonstrate that women have less leisure time due to increased family obligations. For these groups with special needs, enterprises should create and develop innovative leisure products by leveraging cutting-edge technologies such as 5G and artificial intelligence, which help pave the way for the transformation and enhancement of the leisure industry. This can drive the evolution of the digital economy and travel consumption as well as provide an extensive variety of online services. The supply of online products such as "cloud music" and "cloud exhibition" can also be increased, enabling people to conveniently engage in leisure activities at any time and from any location. In particular, for elderly adults, community-based programs that provide leisure activities at home can create a fulfilling and enjoyable lifestyle for them in their twilight years.

Advocate a proper perception of leisure and stimulate potential leisure needs. The findings suggest notable dissimilarities in leisure time among people with different occupational characteristics. To alter this professional imbalance in leisure time, the government should first take the lead in enforcing and penalizing practices, providing channels for employees to report violations, and safeguarding employees' rights and interests. Second, we must promote the perception of leisure across society to encourage employees to pursue reasonable leisure time actively. From a demand standpoint, it is essential to provide a basic guarantee to strengthen people's leisure needs. Highlight the fact that leisure time is the guarantee for people to live a happy life and do not put "leisure" and "labor" in opposition. The purpose of "labor" is to free up more time and money for leisure activities, which not only relax the body and mind but also help people achieve self-reflection and self-improvement, allowing them to dedicate themselves to "labor" more effectively. We should enhance public awareness; promote diverse and healthy leisure options; create a favorable environment for high-end leisure, cultural, and tourism activities; and raise awareness of the importance of leisure by organizing leisure conferences and publishing relevant leisure tourism manuals.

Despite conducting a comprehensive analysis of the effects of time allocation factors, demographic factors, occupational factors, and family factors on leisure time from a micro perspective, this paper has several research limitations. Firstly, due to technical constraints, the SHAP approach utilized in this paper provides insight into how features affect LightGBM model predictions, but it may not reveal the true causal relationships between features and outcomes in the real world. Although this does not undermine the conclusions drawn in this paper, we intend to utilize other causal inference methods such as Double Machine Learning to quantify the causal effects and evaluate the intervention effects by making counterfactual predictions in future studies. Secondly, owing to data unavailability, this paper only considers Beijing as a case study, yet leisure time varies across regions. Thereby, in the future, we aim to incorporate more regions for comparative analysis.

**Author Contributions:** Conceptualization, Q.W.; Software, Y.J.; Validation, Q.W. and Y.J.; Formal Analysis, Y.J.; Investigation, Q.W. and Y.J.; Data Curation, Y.J.; Writing—Original Draft Preparation, Q.W. and Y.J.; Writing—Review and Editing, Q.W. and Y.J.; supervision, Q.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Social Science Fund of China (grant number: 21ATJ004).

**Data Availability Statement:** The data underpinning the presented findings are in Chinese and are available by contacting the corresponding author.

**Acknowledgments:** We would like to express our appreciation for the data assistance provided by the Leisure Economy Research Center of Renmin University of China.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
