Figure 1.
The pipeline of training and testing procedures. The regressor outputs are fixed and passed to the detector as inputs. The training of the regressor and the detector are consecutive steps. The regressor is trained first, and then the detector is trained.
Although the whole framework is called a two-stage framework, the detection network in the second stage is end-to-end training and is easy to deploy. The detailed architecture of the Depth Score Regressor presented in Stage I of
Figure 1 further elaborated on the left side of
Figure 2, while the right side of
Figure 2 provides an in-depth breakdown of the Lesion Detector’s internal structure. In
Figure 2, we provide a comprehensive breakdown of the Regressor’s internal structure, showcasing each individual neural network block and layer involved in processing the input slices to generate the depth scores. It also illustrates each individual neural network block and layer involved in processing the depth scores and input slices to generate the final lesion detection output. This detailed depiction aims to offer a clear understanding of the components and operations within the regressor and the detector, complementing the overall process flow illustrated in
Figure 1.
In the depth score regressor, the starting slice
j and the slice interval
k of the input slices are determined randomly. The network’s layers include convolution, rectified linear unit (ReLU), and max pooling, with parameters adopted from ImageNet pre-trained CNN models, such as AlexNet [
38] or VGG-16 [
20]. After these layers, a new convolutional layer, Conv6, with 512 1 × 1 filters and a stride of 1, followed by a ReLU layer, is added. Conv1–Conv6 are used to learn discriminative deep image features for depth score regression. Subsequently, a global average pooling layer summarizes each of the 512 activation maps to one value, resulting in a 512D feature vector. Finally, a fully connected (FC) layer projects the feature vector to the slice score. The individual blocks and layers of the depth score regressor can be found in
Table 1.
Following the 3DCE network, we adopt the novel Dense 3DCE R-FCN model [
3], which extracts feature maps at various scales from the Conv1 to Conv5 layers. Each convolution block in the feature extraction network (Conv1 to Conv5) has 64, 128, 256, 512, and 512 filters, respectively. The pathways of Conv2 and Conv4 are omitted since they are reduplicative. These feature maps are then processed through a convolutional layer for dimension reduction, depicted as a series of Conv6 layers in
Figure 2. A reshaping operation concatenates the feature maps, generating 3D context information. This is followed by a PSROI pooling layer with a filter size of
. To normalize the feature maps, an L2 norm layer is applied after the PSROI pooling layer [
39], ensuring all level features are obtained. Finally, another concatenation operation combines these features, and the fully connected layers at the end of the pipeline generate the final prediction results. The individual blocks and layers of the pre-trained model and the lesion detector can be found in
Table 2 and
Table 3.
The gray RPN block is designed to generate proposals. The RPN subnetwork processes feature maps extracted from the central images containing ground-truth information. Ultimately, the detection pathway on the right produces the final classification results and detection bounding boxes. The blue and red dotted boxes represent the original and auxiliary losses, respectively, with their functions detailed in the subsequent sections.
3.1. Novel Dense 3DCE R-FCN
In CT slices, the lesion sizes can vary significantly. For example, the long diameters of lesions in the DeepLesion dataset range from 0.42 mm to 342.5 mm, while the short diameters range from 0.21 mm to 212.4 mm [
22]. The smallest lesion is nearly 1000 times smaller than the largest lesion. Such great differences in lesion sizes can bring significant challenges to lesion detection tasks and result in a high rate of false positives (FPs). Thus, we attempt to utilize multi-level and multi-resolution feature maps and generate a dense connection mechanism to meet the needs of both ULD and MOLD tasks in a dataset containing various large and small lesions.
We adopt the 3DCE R-FCN model as the backbone of our framework, as mentioned in
Section 2.3. The 3DCE R-FCN network is formulated with reference to the original R-FCN model [
21] but includes four additional layers, including one fully connected (FC) layer, one rectified linear unit (ReLU) activation layer, and two FC layers for the final prediction results. Compared with the Faster R-CNN model, the R-FCN network can utilize the position information of a CT slice image through a PSROI pooling layer. We use the VGG-16 CNN model [
20] as the feature extractor, just as described in [
1], removing the 4th and 5th pooling layers to maintain the resolution of the feature maps and prevent the feature maps from becoming too small. As shown in
Figure 2, in the second stage, every three images can be viewed as a 3-channel image (one sample). Then, these three images are used as the input for the feature extraction network. During training, the central image provides the bounding box location information, and the other slices offer the 3D context information [
1]. For each convolution block in the feature extraction network from Conv1 to Conv5, the number of filters is 64, 128, 256, 512, and 512, respectively. The RPN subnetwork only accepts the feature maps extracted from the central images containing the ground-truth information.
CNNs always convolve and pool images to extract more discriminate features, which are more suitable for large rather than small lesion detection, but the deeper the feature extractor is, the smaller the resolution the feature maps have. These smaller feature maps can challenge the detection of small lesions, which only occupy a few pixels. Thus, the necessity to fully use feature maps of various resolution scales is significant. Based on the 3DCE network, we follow the novel Dense 3DCE R-FCN model [
3], which extracts feature maps in various scales from the Conv1 layer to the Conv5 layer for an extension, as shown in the second stage in
Figure 2. These feature maps integrate the image features, which are shallow but high in resolution, intermediate but complementary, and deep but contain rich semantic information [
40]. All these feature maps are then delivered to one convolutional layer for dimension reduction, which can be seen as a series of Conv6 layers in
Figure 2. After that, a reshaping operation concatenates the feature maps together and generates 3D context information. Then, one PSROI pooling layer follows, and the filter size is
. In order to normalize the feature maps in different amplitudes, an L2 norm layer is used after the PSROI pooling layer [
39]. Therefore, all level features are obtained. Finally, another concatenation operation is used to combine all-level features. Meanwhile, the fully connected layers work at the end of the pipeline to obtain the final prediction results.
To provide a clearer comparison between our proposed model and the 3DCE R-FCN model described in [
1], the framework architecture of 3DCE R-FCN is illustrated in
Figure S1 of the Supplementary Materials. As depicted in
Figure S1, the primary distinction from our proposed method lies in the fact that the 3DCE R-FCN model employs a one-stage end-to-end training architecture. Additionally, it does not incorporate dense pathways or auxiliary losses, as highlighted in the dotted red boxes in
Figure 2.
3.2. Skipped-Layer Hierarchical Training
The employment of the dense connections described in
Section 3.1 can easily cause gradient vanishing and model degradation [
36] because the depth and width of the model are both increased. Thus, we also need to overcome these limitations. An effective method is to strengthen the supervision of the middle layers. As shown in
Figure 2, in the second stage, the method presented by Zhang et al. [
3] employs some auxiliary losses after each pooled 3D context feature map, which can be seen from the red dotted boxes in
Figure 2, as well as the gray part (gray “Auxiliary Loss 2” and “Auxiliary Loss 4”) in
Figure 7. Rather than only having the classification and regression losses at the output layer, these auxiliary losses can further provide integrated optimization via direct supervision of the earlier hidden layers (from Conv1 to Conv5). Furthermore, these “auxiliary losses” can speed up the network’s convergence through their supervision pathways [
36]. From [
3], it is clear that through these additional pathways, the model is also forced to learn adequate discriminate features from the shallow layers, which can boost the detection performance, especially on the small lesions.
However, utilizing all these auxiliary losses in the whole training process is not an optimal training strategy. The training process of the CNN model is an iterative optimization process to find the most suitable fitting model. In the early stages of network training, the network focuses on coarse-grained learning, while in the later stages, fine-grained learning gradually becomes the emphasis of this training process. Therefore, we improve the DAL strategy in [
3] to the SHT strategy, achieving optimal performance. As shown in
Figure 7, in the first stage of training, we adopt all the auxiliary losses, which can be noted as dense auxiliary losses in [
3]. Then, only skipped-layer losses from Conv1, Conv3, and Conv5 pathways are retained in the second stage of training. This kind of hierarchical training strategy can reduce over-fitting to some extent.
We minimize the objective functions following the multi-task loss in Faster R-CNN [
20]. These losses refer to the classification loss (
), the regression loss (
), and the depth-aware loss (
), which is highlighted in the green dotted box in
Figure 2. The classification loss (
) and the regression loss (
) consists of two parts. The first part is the auxiliary losses, which are highlighted inside the red dotted boxes in
Figure 2, while the second part is the original classification and regression losses as in [
1], which are located at the end of the whole framework and are highlighted inside the blue dotted boxes in
Figure 2. These losses are not combined in the network but jointly optimized through the backpropagation process during training. In Equation (
1) (Note: Equations (
1) to (
7) are all described for the MOLD task, which includes the background class and eight lesion classes, but not the ULD task):
here, as shown in Equation (
1), the loss function contains three components: the classification loss (
), the bounding box regression loss (
) and the normalized depth score regression loss (
). In the training process, we set
and
to control the relative importance among these three components. The details of
are described in
Section 3.3.2).
As shown in Equation (
2), the classification loss
consists of two components, the RPN classification loss
and the detection classification loss
. In Equation (
3), we set
= 256 because, in the training process of the RPN sub-network, we selected 256 anchors in a mini-batch.
i is the index of the anchors in the mini-batch.
I is the anchor set in a mini-batch.
is the predicted probability of the
i-th anchor being the foreground object, while
donates the ground-truth label of the anchor (1 for positive and 0 for negative).
In Equation (
4), the
term is normalized by the mini-batch size (i.e.,
= 256). Unlike the classification loss in the RPN sub-network, which is a binary cross-entropy loss, the classification loss in the detection network is a multi-class cross-entropy loss.
d is the index of the supervision pathway, where
in the first 4 training epochs, which indicates that there are a total of 5 pathways that take effect in the first training stage. Then, we set
in the following 4 epochs since the second and fourth pathways have been removed.
c represents the class of the lesions in the dataset, and
C is
, which means that there is a total of 9 classes, including the background class. Other parameters, which are not mentioned here, are the same as those in Equation (
3).
Similar to the classification loss, as shown in Equation (
5), the regression loss (
) also consists of two components, the RPN regression loss
and the detection regression loss
. Both regression losses can be calculated through Equation (
6). In Equation (
6),
is an indicator function, which aims to ignore the regression loss of the background ROIs by setting the value to 1 if
and 0 otherwise.
represents the number of anchor locations.
= (
) donates the parameterized coordinates of the predicted bounding boxes of the
d-th supervision pathway, where
and
h denote the box’s center coordinates and its width and height, respectively.
represents the parameterized coordinates of the ground-truth box with a positive anchor. Other parameters, which are not mentioned here, are the same as those in Equation (
4). The smooth
loss is defined in Equation (
7), and the details can be found in [
25]. In the RPN training process, we set
to 3, while in the detection process, it is set to 1, which is the same as in [
1,
25]. According to the explanation from the authors of [
25], setting
to 3 in the training process makes the transition point from quadratic to linear happen at
<
, which is closer to the origin. The reason for doing this is because the RPN bounding box regression targets are not normalized by their standard deviations, unlike in Fast R-CNN [
24], because the statistics of the targets are constantly changing throughout learning.
3.4. Participants and Dataset
We have evaluated the proposed method on the publicly-available DeepLesion dataset [
22], which contains 32,735 lesions from 32,120 axial CT slices from 10,594 CT scan studies of a total of 4427 patients. Different from other datasets, which contain lesions from benign and malignant categories, the DeepLesion dataset does not classify the lesions into benign or malignant. Each CT slice has 1–3 lesions with corresponding bounding boxes and size measurements. All lesions in the dataset have been annotated using the REICIST diameters, including the long and short diameters. The resolution of most of the images is
, while only 0.12% of the images have a resolution of
or
. To investigate the lesion types in DeepLesion, the authors in [
22] randomly chose 9816 lesions and manually categorized them into eight categories. There are a total of 8 different types of lesions with different proportions in the 9,816 lesions, including bone (BN,
), abdomen (AB,
), mediastinum (ME,
), liver (LV,
), lung (LU,
), kidney (KD,
), soft tissue (ST,
) and pelvis (PV,
) [
18]. The mediastinum lesions are mainly composed of tibial lymph nodes [
22]. Abdomen lesions are complications other than the liver or kidneys. Soft tissue types included muscle, skin, and minor complications [
22].
In order to better evaluate the performance of the proposed method, we extracted one small lesion dataset and one multi-organ lesion dataset from the original DeepLesion dataset. We selected the small lesions whose areas were less than 1% of the largest lesion to build a small lesion dataset. We extracted the multi-organ lesion dataset because only 30% of the samples of the DeepLesion dataset were given category information. Therefore, only 9816 lesions on 9626 axial slices were used for comparison experiments. The statistical details of the official DeepLesion dataset and its subsets can be found in
Table 4. We focused on the ULD task on the original DeepLesion dataset and the small lesion dataset, which only distinguishes lesions from non-lesions. We focused on the MOLD task for the multi-organ lesion dataset, which provides bounding boxes with specific lesion type results. Our method’s effectiveness can be further investigated by testing it on these three datasets.