1. Introduction
In 2018, more than 1.7 million people died from lung cancer, which accounted for the highest proportion of deaths of all kinds of diseases, reaching 18.4%. Meanwhile, lung cancer led to the highest number of new cases in 2018, at more than 2 million cases [
1]. Lung cancer has the highest death rate (26%) among the top four major cancers with a low five-year survival rate of 18%, according to the report from American Cancer Society in 2019 [
2]. The reason for the low survival rate is that the clinical symptoms of lung cancer usually present in the advanced stage [
3]. Hence, the early diagnosis is essential, which can improve the five-year survival rate to above 90% and also enhance the chance of cure.
A lung nodule is an abnormal growth on the lung that is smaller than 30 mm [
4]. If it is larger than 30 mm, it is called lung mass and has a higher chance of being cancerous [
5]. Lung nodules can be benign and malignant. Benign nodules are noncancerous, and generally have more regular shapes, smoother contours, smaller size, and fatter tissue density than cancerous nodules [
4], as shown in
Figure 1a. Benign nodules are usually asymptomatic and do not have the risk of spreading to other body parts. Malignant nodules are cancerous, and have more irregular morphologies, higher density, and are larger than benign nodules and show the evidence of nodule spiculation [
4], as shown in
Figure 1c. Malignant nodules can spread to other organs or tissues and proliferate quickly, consequently causing physical discomfort and even threatening life. There are also some nodules that are hard to be consistently diagnosed among radiologists who have different experiences, as shown in
Figure 1b. They may have similar morphologies as both benign and malignant nodules. For example, patients may have solitary or single pulmonary nodule as well as multiple nodules. Therefore, doctors can provide accurate prognosis and treatment for patients through accurate diagnosis of each stage of the nodules.
The diagnosing of lung cancer is assisted in computer tomography (CT). The physicians give interpretations and inferences relying on their experiences. However, inescapably, they are affected by subjectivity, or even fatigue when they read many CT images in a day [
6]. At present, a computer-aided diagnosis (CAD) system has become an important measure and assistance to mitigate their work, especially the deep learning methods [
7]. Deep learning is popular and performs impressively on numerous aspects of diagnostic modes, such as medical imaging [
8], EEG, ECG [
8], etc. Due to the extensive and in-depth development of deep learning in computer vision, the progress of medical imaging benefits from this tend. There are plenty of techniques used in medical image processing, such as convolutional neural network (CNN) [
9], transfer learning [
8], attention mechanism [
10,
11], generative adversarial networks (GAN) [
12], unsupervised learning [
13], etc. for disease classification, detection, segmentation, reconstruction, and so forth [
14,
15].
Likewise, lung cancer diagnosis and classification has acquired some achievements from CNNs [
14]. For binary classification, Shen et al. [
16] proposed a multi-scale CNN (MCNN) to train three scales of input images. These images are inputted into a weight shared network which can be regarded as one CNN network, and the network is a standard shallow CNN. Therefore, the classification performance is limited. Then, to improve the classification performance, they proposed a new structure named multi-crop CNN (MC-CNN) [
17]. This structure is mainly based on the multi-crop pooling method, which crops the center of the feature maps twice to quarter size. Next, it uses different times of max-pooling to produce multi-crop features. However, multiple downsampling causes the loss of useful features. They achieved 87.14% for binary classification. Nóbrega et al. [
18] studied the performance of transfer learning for lung nodule classification based on 11 kinds of existing and widely used models, such as VGG, ResNet, DenseNet, and NASNElarge. Then, they classified the extracted deep features using six categories classifiers, e.g. Support Vector Machine (SVM), Random Forest (RF), etc. Finally, they obtained the best accuracy of 88.41% and AUC of 93.19% by ResNet50 with SVM-RBF. Dey et al. [
19] proposed four two-pathway networks based on the 3D CNN: basic 3D CNN, 3D multi-output CNN, 3D DenseNet and 3D multi-output DenseNet (MoDenseNet). All the networks are pre-trained on the ImageNet. The two-pathway networks are trained by the two-view of inputs. Finally, they gained the best accuracy of 90.47% by the 3D multi-output DenseNet. However, they only applied 686 nodule samples for both training and testing processes.
In 2019, a deep local–global network was proposed for lung nodule classification [
20]. In this structure, the residual blocks and the non-local blocks are connected alternately, followed by a global average pooling layer and sigmoid classifier. However, they did not change the size of the features during the entire process, which is all through the same as inputs. Then, suddenly the size is changed to 1 × 1 using global average pooling. In this way, there are many redundant features during the convolutions and information is lost when using the global average pooling, which is not suitable for extracting useful features for the classification. El-Regaily et al. [
21] proposed a multi-view CNN to reduce the false positive rate for lung nodule binary classification. They obtained three views from the nodule’s 3D model: axial, coronal, and sagittal. Then, these three views are inputted into three complete CNNs, training their softmax classifiers individually. Finally, they fused three outputs to get the final classification result using a logical OR operation. However, the CNNs they used are three layers of the standard network. Even though they used a logical OR operation to select the best classifier, the classification ability of each classifier is limited. Hussein et al. [
22] proposed to classify the lung and pancreatic tumors using both supervised and unsupervised methods. For the supervised model, they introduced a multi-task learning method based on a transfer-learning-based 3D CNN, which combines the classification and attribute score estimation. They obtained an accuracy of 81.73% for binary classification. For the unsupervised learning, they used the proportion-support vector machine to get the initial labels and label proportions. Then, they classified the tumors based on them. They achieved a 78.06% accuracy for lung nodule classification. However, the dataset they used only includes 1144 nodules. Liu et al. [
23] proposed a multi-task network for lung nodule binary classification and attribute score regression to enhance the performance of two tasks. Then, they applied a Siamese network with a margin loss to learning to rank the most malignancy-related features to improve the distinguishing ability of the network. They achieved an accuracy of 93.5%, a sensitivity of 93.0%, a specificity of 89.4%, and an AUC of 97.90%. However, they only used 1250 images. Although they used five-fold cross-validation to evaluate the effectiveness and robustness of the network, the quantity of the dataset is only 5.7% of ours. Small datasets may cause a lack of universality.
However, not all nodules have distinguishing characteristics to identify. There are some indeterminate nodules for experienced professionals. They need to combine other measures to make a definite diagnosis. Therefore, identifying these nodules is also essential to the treatment and cure of the patients. In this task, we label this class of nodules as indeterminate for ternary classification. Shen et al. [
17] also considered the ternary classification, and the accuracy is only 62.46%. In our previous work [
24], we proposed a multi-level convolutional neural network (ML-CNN), which consists of three levels two-layer CNNs, and achieved an accuracy of 84.81%.
As lung nodules have different sizes and various morphologies, inspired by Inception [
25], multi-scale features of input are gaining attention. While different from Inception, we designed a multi parallel levels structure based on the ResNets [
26], and all levels are with different scales of convolutional kernels. Moreover, to share the information among all the levels, we propose to connect the residuals cross levels additionally. Overall, in this paper, we propose a multi-level cross ResNet (ML-xResNet) to enhance the performance of lung nodule malignancy classification in thoracic CT images, including ternary and binary classifications. We classify the nodules into benign, indeterminate, and malignant in ternary classification, and then we delete indeterminate nodules to get the binary dataset. Our approach was evaluated on the Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) database [
27].
ML-xResNet is based on the following ideas. Since the lung nodules have various sizes and morphologies, using a fixed size of filter to extract their features causes information loss or insufficiency. To solve this problem, we design a novel structure named multi-level cross residual (ML-xRes) block. The ML-xRes block consists of three parallel residual blocks in the same structure but different convolution kernel sizes for extracting multi-scale features of the input nodules. Moreover, due to the lack of fusion among all scales of features, to improve the efficiency of extracting multi-scale features of the inputs, we propose to connect the residuals not only in the same levels but also cross other levels. In this way, with the training process, the information on all levels is shared and combined. Then, we insert the ML-xRes blocks into our multi-level structures. Except for the blocks, others are the typical convolution structures with max-pooling layers. Finally, we fuse the outputs of three levels using the concatenate technique and classify the nodules using a softmax classifier.
The rest of the paper is organized as follows.
Section 2 describes the details of the proposed ML-xResNet.
Section 3 introduces the materials we used.
Section 4 and
Section 5 show the experiment results and discussion.
Section 6 concludes the paper.
5. Discussion
Usually, the objects we study are with various sizes and morphologies, especially the organs, tissue, and lesions. Researchers have taken some measures to tackle this problem, such as the multi-view networks and the multi-scale networks. However, mostly, it is difficult or costly to get multi-views or scales of an object of medical image. Therefore, we turn our idea to extract multi-scale features of a fixed view image instead of the fixed scale features of multi-view/multi-scale inputs. We propose to achieve it by the ML-xResNet algorithm, which extracts multi-scale features for different sizes and morphologies without any changing for the inputs.
Based on the experiment results, we found that the proposed ML-xResNet accomplishes the best performance in both ternary and binary classification tasks using the same architecture and hyperparameters. Similarly, Shen et al. [
17] also implemented their method, MC-CNN, in both the above tasks; however, their performance of binary classification was 24.86% higher than of ternary classification, while this gap is only 7.1% using our model. It reveals that our method has better performance for similar tasks. It indicates that our multi-level xRes strategy can explore and extract more powerful features than their multi-crop pooling strategy; therefore, our method has a better representation power for the lung nodule images.
According to the results in
Table 1, we know that the multi-level structure is effective for solving the lung nodule classification problem. However, with the growth of the levels, the parameters are too many to get better performance. The overfitting problem shows more influence on the result than the improvement with more levels and scales of features. This situation also appears in
Table 2. Then, we use the dropout technique in our model. However, as shown, when we use dropout layers after all the convolution layers, it shows a worse result than only adding them below the odd layers. The above indicates that, if the network is too wide or too deep, the performance would be worse with the growth of the parameters. The reason is that, if the network is too wide, it will produce redundant information, which causes the overfitting problem; if the network is too deep, the details of the features would be only few pixels or even disappear, which leads to the features not being distinguished, as well as to the overfitting problem. To avoid the too wide or too deep architecture requires more experience and experiments for different tasks, which is what we did in this study. Moreover, the number of dropout layers or dropout rate also reflects the performance of the network. Therefore, selecting appropriate hyperparameters is very important.
After the evaluation through the experiments, it is indicated that the enhancement of performance attributes to the proposed multi-scale convolution method, which helps the network extract more effective features, is also attributed to the strategy of cross residuals, which guarantees all scales of features share information during training to improve the identification ability. Besides, the selection of dropout parameter helps the network to minimize the influence of overfitting. Furthermore, the proposed method is referential to the medical imaging tasks whose classification objects have a variety of sizes and morphologies, and our experimental procedure also displays the exploration process.