1. Introduction
Lung cancer is a common disease that has a higher mortality rate than other cancers. It is the main cause of cancer death [
1]. According to the American Cancer Society, the number of new lung cancer cases in the United States is expected to reach 238,340 this year, with 127,070 deaths because of this. Computed tomography (CT) imaging is the most commonly employed method for detecting lung diseases [
2,
3]. Regular CT screening for people at high risk of developing lung cancer can reduce the risk of dying from this disease. Professional doctors can diagnose lung cancer according to the morphological characteristics of the lesions in CT images. However, CT scans produce huge amounts of image data, which increases the difficulty of performing proper disease diagnosis. Furthermore, doctors may make a wrong diagnosis due to long work shifts and monotonous working. In addition, even experienced doctors and experts can easily miss some small potential lesions. Therefore, automatic detection of lung tumors, based on CT images, needs to be further advanced for improving the quality of diagnosis.
Accurate detection of lung cancer is a challenging task. On the one hand, the tumors have complex edge features and may change their position [
4]. As illustrated in
Figure 1a, showing the CT chest images of patients with lung cancers, the texture, gray scale, and shape of tumors are important for clinical staging and pathological classification [
5]. On the other hand, redundant image information causes difficulties in the detection task. For example, the images of abundant blood vessels, bronchi, and tiny nodules in the lung interfere with the unique features of tumors. In addition, tumors have different sizes (
Figure 1b) and different types of tumors have different growth rates. For example, the multiplication rate of lung squamous cell carcinoma is lower than that of lung adenocarcinoma. Moreover, tumors of the same type have different sizes at different stages of their development [
6]. In addition, a tumor naturally has different sizes in multiple CT scanning slices. The challenge brought by the difference in tumor sizes seriously limits the accuracy of existing methods for tumor detection.
To date, a lot of work has been done on automatic detection of lung lesions. The early computer-aided lung cancer detection methods mainly relied on an artificially designed feature extractor. Feature extractor can obtain the gray scale, texture and other morphological features of a tumor in an image, which are subsequently fed into a Support Vector Machine (SVM) or AdaBoost for classification. However, the artificially designed features cannot well correspond to the highly variable tumor size, position, and edge, thus limiting the detection ability of these methods [
7]. Recently, as deep learning has been increasingly applied in various medical and health care fields, and many researchers have devoted themselves to the study of lung tumor detection based on deep neural networks (DNNs) [
8,
9]. Unlike traditional methods relying on artificial design, DNNs have a large number of parameters and can fit semantic features better.
Gong et al. [
10] used a deep residual network to identify lung adenocarcinoma in CT images, and obtained comparable or even superior outcomes compared to radiologists. Mei et al. [
11] conducted experiments on the PN9 dataset to detect lung nodules in CT scans using a slice-aware network. The results showed that the proposed SANet outperformed other 2D and 3D convolutional neural network (CNN) methods and significantly reduced the false positive rate (FPR) for lung nodules. Xu et al. [
12] designed a slice-grouped domain attention (SGDA) module that can be easily embedded into existing backbone networks to improve the detection network’s generalization ability. Su et al. [
13] used the Bag of Visual Words (BoVW) and a convolutional recurrent neural network (CRNN) to detect lung tumors in CT images. The model first segments the CT images into smaller nano-segments using biocompatibility techniques, and then classifies the nano-segments using deep learning techniques. Mousavi et al. [
14] introduced a detection approach based on a deep neural network for identifying COVID-19 and other lung infections. More specifically, their method involves using a deep neural network to extract features from chest X-ray images, employing an LSTM network for sequence modeling, and utilizing a SoftMax classifier for image classification. This method shows excellent performance in detecting COVID-19 and can help radiologists make diagnoses quickly. In [
15], Mei et al. utilized a depth-wise over-parameterized convolutional layer to construct a residual unit in the backbone network, leading to improved feature representation ability of the network. Moreover, the study also implemented enhancements in the confidence loss function and focal loss to handle the significant imbalance between positive and negative samples during training. It is noteworthy that this method focuses on the efficiency and practicability of the detector. Version 4 of You Only Look Once (YOLO), i.e., YOLOv4, was used as a benchmark for this method but there have been few studies using YOLO to detect lung tumors so far.
Although many processing methods exist for automated detection of lung tumors in CT images, the variability of tumor size is less considered. As indicated above, the size of lung tumors exhibits variability, thus posing challenges for precise tumor detection. As the multi-scale issue constrains the efficacy of prevalent detection methods, some researchers have paid attention to this issue and proposed improvements to the existing methods. Causey et al. [
16] utilized 3D convolution in combination with Spatial Pyramid Pooling (SPP) to develop a lung cancer detection algorithm, which enabled reducing the FPR based on the National Lung Screening Trial (NLST) data cohort used for testing, whereby the area under the curve (AUC) value reached 0.892, proving that the detection performance is better than that of using only 3D convolution. Compared with detecting 2D slices one by one, 3D convolution can be used to obtain rich space and volume information in adjacent slices, and models can be generalized to sparsely annotated datasets. However, 3D convolution consumes more computer memory than conventional convolution. Other studies have proposed feature pyramid networks (FPNs), whereby the recognition of small-size tumors depends on features from the shallow network, while the top-level network has more abundant semantic information, which is important for the accurate classification of tumors. The purpose of FPN is to connect the feature maps spanning different layers, so as to restore the low-resolution information of the deep feature map and enhance the semantic information of the shallow feature map. In order to effectively integrate multi-scale information, Guo et al. [
17] fused feature maps at different layers. In [
18], Guo and Bai construct a FPN to detect multi-scale lung nodules, thus significantly improving the accuracy of small lung nodule detection. In [
19], by applying a bi-directional FPN (BiFPN), the feature fusion structure of YOLOv5 was improved, and a fusion path was added between features at the same layer. Some other improvements of feature fusion network have also achieved good results in other tasks [
20]. The original FPN structure and its variants adopt complex cross-scale connections to obtain a stronger multi-scale representation ability. Although helpful for improving multi-scale tumor detection, this operation requires more parameters and increases computational expenses, so it is contrary to the general expectation of a highly efficient detector.
Taking inspiration from the previous work, we have considered a real-world hospital scenario where a large volume of CT data is available, but hardware resources are limited. To reduce hardware costs while maintaining the speed of tumor detection, we have selected YOLOv7 as the underlying framework as it can achieve good balance between accuracy and tumor detection speed without requiring generation of candidate boxes, as opposed to the two-stage detection models. In this paper, we propose a novel one-stage detection model, called ELCT-YOLO, based on the popular YOLOv7-tiny network architecture [
21], for solving the problem of multi-scale lung tumor detection in CT scan slices. For ELCT-YOLO, firstly, we designed a Decoupled Neck (DENeck) structure to improve the multi-scale feature representation ability of the model. Different from the previous design scheme of feature fusion structure [
22,
23], we do not stack a large number of basic structures, nor build a complex topology. We propose the idea of decoupling the feature layer into a high-semantic region and low-semantic region, as to reduce semantic conflict in the fusion process. Secondly, we propose a Cascaded Refinement Scheme (CRS), which includes a group of Receptive Field Enhancement Modules (RFEMs) to explore rich context information. Using atrous convolution, we constructed two multi-scale sensing structures, namely a Series RFEM (SRFEM) and a Parallel RFEM (PRFEM). In order to expand the effective receptive field, these serial structures use a series of atrous convolutions with different sampling rates. At the same time, a residual connection was applied to alleviate the grid artifacts as per [
24]. The achieved parallel structure can construct complementary receptive fields, in which each branch matches the amount of information based on its own receptive field. In addition, we studied the performance of different cascaded schemes through experiments.
The main contributions of this paper can be summarized as follows:
In order to solve the problem of multi-scale detection, a novel neck structure, called DENeck, is designed and proposed to effectively model the dependency between feature layers and improve the detection performance by using complementary features with similar semantic information. In addition, compared with the original FPN structure, the design of DENeck is more efficient in terms of the number of parameters used.
A novel CRS structure is designed and proposed to improve the robustness of variable-size tumor detection by collecting rich context information. At the same time, an effective receptive field is constructed to refine the tumor features.
It is proposed to integrate the spatial pyramid pooling—fast (SPPF) module of YOLOv5 [
25] at the top of the original YOLOv7-tiny backbone network in order to extract important context features by utilizing a smaller number of parameters and using multiple small-size cascaded pooling kernels, in order to increase further the model’s operational speed and enrich the representation ability of feature maps.
5. Conclusions and Future Work
This paper has proposed an efficient one-stage ELCT-YOLO model based on improvements introduced into the YOLOv7-tiny model, for lung tumor detection in CT images. Unlike existing neck structures, the proposed model aims to obtain multi-scale tumor information from the images. Firstly, a novel Decoupled Neck (DENeck) structure has been described for use in ELCT-YOLO to reduce semantic conflicts. More specifically, the model’s neck was divided into high-semantic layers and low-semantic layers, in order to generate clearer feature representations by decoupling the fusion between these two semantic types. The conducted experiments proved that DENeck can be integrated well into backbone networks of different depths, while also showing outstanding robustness. Secondly, a novel Cascaded Refinement Scheme (CRS), configured at the lowest layer of the decoupling network, has been described for use in ELCT-YOLO in order to capture tumor features under different receptive fields. The optimal CRS structure was determined through another set of experiments. In addition, the problem of sparse sampling caused by dilated convolution has been considered and the effect of different receptive field combinations on the cascaded modules has been compared by means of experiments. Thirdly, it has been proposed to integrate the SPPF module of YOLOv5 at the top of the original YOLOv7-tiny backbone network in order to extract important context features, further improve the model’s operational speed, and enrich the representation ability of feature maps. Extensive experiments, conducted on CT data provided by Lung-PET-CT-Dx, demonstrated the effectiveness and robustness of the proposed ELCT-YOLO model for lung tumor detection.
The presented study has focused on addressing the multi-scale issue of tumor detection using a lightweight model. The model still needs further optimization in reducing both the number of parameters and computational complexity. As a next step of the future research, we will use network distillation techniques and existing lightweight convolutional modules to construct a simpler model, aimed at reducing the inference latency and parameters’ number. In addition, the study presented in this paper has only focused on tumor detection tasks based on CT images. In fact, some emerging technologies such as super-wideband microwave reflection measurement are more user friendly and cost effective than traditional detection techniques such as the CT-based ones [
55]. In the future, we will also focus on studying emerging technologies for lung cancer detection more in depth.