With a severe lack of labeled data, using limited labeled data to improve model performance has become a significant problem in GIST detection. To fully utilize all data, including unlabeled data, we propose an SSL method based on self-training and design Improved Faster R-CNN as the detection algorithm according to the characteristics of GIST. The Improved Faster R-CNN containing the multiscale module and the FEM can better integrate multidimensional feature information and combine deep semantic information with shallow location information. We further developed a new pseudo-label selection strategy to improve the robustness of the model. By applying the dynamic threshold constraint and the IOU constraint to the prediction results of unlabeled data, the reliable pseudo-labels can be retained for subsequent training. In the subsequent sections, we describe improvements to the Faster R-CNN by introducing two new modules aimed at characterizing GISTs (
Section 3.1). Next, we introduce a novel pseudo-label selection strategy (
Section 3.2) and outline our self-training approach (
Section 3.3).
3.1. Improved Faster R-CNN
We optimized the Faster R-CNN to improve the accuracy of GIST detection and ensure the quality of pseudo-labels; the network structure is shown in
Figure 2. In this paper, two optimization modules are proposed: (1) Given the large variability of GISTs, a multiscale module was developed to use feature information of various levels. (2) The FEM was introduced to combine channel and spatial dimension information for the complex background of GIST images.
One of the challenges of GIST detection is that the object scale varies excessively. Using the single-layer feature map for prediction may affect the accuracy of the result due to the limited information, so the feature maps at different levels should be combined for detection. The traditional feature pyramid network (FPN) [
27] can fuse information of the low-level with that of the high-level, but there are still some problems: (1) The transmission path between the low-level features and high-level features is too long, which increases the difficulty of access. (2) Although FPN utilizes the information of different layers, each layer only contains the information of the current layer and higher layers. The lack of location information of lower levels is not conducive to small target detection. In response to the problems in the FPN, we improve it by adding a bottom-up connection based on the original path. When downsampling the feature map of the
layer,
and
are bilinearly interpolated to resize to the same scale (the size of the feature map of the
layer), and then the fused results are combined with the feature map of the
layer to obtain the feature map of the
layer. We choose the Inception [
28] convolution block for feature map fusion to solve the problem of excessive computation caused by a large convolution kernel. The improved FPN structure enables the feature map of each layer to contain both the semantic information of the deeper layers and the rich localization information of the first layer, assisting the model in performing better detection.
Another challenge of GIST detection is the difficulty of distinguishing the foreground from the background in CT images. The lesion area shares certain similarities with the surrounding background, and it is hard for the basic model to separate the object, so the FEM is introduced. The feature map obtained through convolution only contains the spatial information in the local receptive field and lacks the connection between each channel. If the information of each channel is only processed globally, the information interaction within the space is missed. Our FEM uses both a channel attention mechanism and a spatial attention mechanism to enhance feature representation, highlight relevant features of the GIST lesion area, and suppress background noise, thus enhancing the feature extraction ability of the network.
We use the channel attention mechanism (CAM) [
29] to model the correlation between each channel and obtain the weight of each channel. The process can be written as follows:
where
F represents the original feature map,
and
denote the max pooling and the average pooling,
denotes that the convolution kernel size is
and the number of channels becomes
times of the original, and
is the
.
The spatial attention mechanism (SAM) [
29] is used to model the correlation of the spatial position on the feature map of each channel and calculate its weight. The feature map is calculated as follows:
where
denotes that the convolution kernel size is
, and ∥ represents merging in the channel dimension. The FEM refers to BAM [
30] and establishes a parallel connection between the CAM and the SAM. Finally, the calculating process can be expressed as follows:
3.2. Pseudo-Label Selection Strategy
The correctness of pseudo-labels is crucial for subsequent training iterations. If incorrect pseudo-labels are added to the dataset, this will hinder the optimization of model parameters. To this end, we designed a pseudo-label selection strategy based on the dynamic threshold and the IOU constraint, which can effectively screen out pseudo-labels with a higher correct probability and help the model converge.
The method of selecting pseudo-labels by using an unchanging threshold has numerous drawbacks. If the threshold is set too high, the model will filter out the candidate bounding box in the target area and prevent it from being added to the pseudo-label set, leading to a large number of false negative examples in the subsequent training phase. In contrast, if the threshold is set too low, numerous candidate bounding boxes in the nontarget area will be added to the pseudo-label set, thus generating many false-positive examples in the next round of training. In fact, as training progresses, the network’s detecting ability gradually advances, and the validity of the generated pseudo-labels rises. Therefore, the threshold value used to choose the pseudo-label should be dynamic. To avoid incorrect pseudo-labels from influencing model training, we set a high selection threshold at the early stage of training. As training proceeds, we gradually lower this threshold to prevent correct pseudo-labels from being eliminated. Selecting pseudo-labels through the dynamic threshold makes more sense. The value of threshold in the
round is defined as follows:
Based on the dynamic threshold, we created a new IOU constraint. The IOU constraint sets the condition for the retention of pseudo-labels. Only when the IOU between the detection results of various transformed images is higher than 0.9 do we regard the bounding box as the pseudo-label.
Figure 3 shows the results after applying the IOU constraint. By comparing the original results with the true label, it can be found that the detection box on the right is a false-positive example, which will affect the optimization of the model if it is kept as a pseudo-label. After the IOU constraint is applied, the false-positive bounding box can be successfully excluded, which further ensures the quality of the pseudo labels.
By synthesizing the dynamic threshold and the IOU constraint, the selecting criteria of pseudo-labels in the
round can be expressed as follows:
where i and j denotes the
bounding box of the
image,
is the confidence of the bounding box, and
= 1 represents that the pseudo-label corresponding to this bounding box is retained, with the opposite being discarded.
3.3. Self-Training Method
In this paper, we propose a semi-supervised GIST detection algorithm based on the self-training method, which aims to improve the effectiveness of GIST detection with a small amount of labeled data and a large amount of unlabeled data. The whole procedure of the GIST detection is shown in Algorithm 1.
Algorithm 1 Procedure of Semi-Supervised GIST Detection. |
Input: Labeled data, L; Unlabeled data, U; Data augmentation strategies T in |
Output: Trainable parameters of network, W |
1: Initialize hyperparameters: rounds of iteration Q, times of data augmentation K, threshold |
2: Pretrain the model on L to get the initial parameters W |
3: for do |
4: for do |
5: Use W to predict on |
6: end for |
7: |
8: Filter the results according to (7) to obtain the set of pseudo-labels |
9: Reassemble to acquire the new training set: |
10: Retrain the model on to acquire the new parameters W |
11: end for |
12: Return W |
For the labeled data, the labels are the actual bounding boxes, and the confidence is set to 1. We first train with the labeled data to obtain the initial model and then apply different data augmentation strategies to the unlabeled data following the data distillation method proposed by Radosavovic et al. [
31].
The data augmentation strategies used in this paper mainly include flip, rotation, and affine transformation. When the flip is chosen as the data augmentation method, the corresponding detection result needs to be flipped as well. For the affine transformation, the set translation range does not exceed 10 pixels, and the position of the bounding box does not vary greatly, so it can remain unchanged. For the rotation operation, the given rotation angle is an integer multiple of 90° or less than 10°. When the angle does not exceed 10°, the position of the bounding box stays unchanged, referring to the affine transformation operation; when the image is rotated 90° clockwise, the coordinates of the corresponding bounding box need to be rotated 90° counterclockwise, and so on for other angles.
The initial model detects the images after data augmentation 1 to k times respectively, and all the results are fused to obtain the pseudo-labels. The pseudo-labels generated by prediction may have some errors. For this reason, we use the dynamic threshold and the IOU constraint to enhance the quality of pseudo-labels. The dynamic threshold refers to a threshold that changes dynamically for the confidence of the pseudo-label, utilizing a higher confidence threshold in the early stages of training and progressively lowering the threshold as training progresses. The IOU constraint is a constraint on the overlap area between the bounding boxes predicted by the initial model on the images after data augmentation 1 to k times. After the transformation, the bounding box is used as a pseudo-label for that image only if it appears on all images with a similar position and size.
After clean pseudo-labels are filtered out through the above-mentioned constraint strategies, Mixup [
9] is used to linearly mix the labeled data and the pseudo-labeled data. The new samples acquired after mixing are then used once more for the training, which can substantially enhance the network’s generalization capacity.
Mixup is a crucial part of the MixMatch [
16] framework, which enables the model to obtain better generalization performance by linearly interpolating pairwise training samples. The traditional Mixup is designed for image classification tasks, where each image is associated with one class label. The generated image
and its label
can be defined as follows:
where
x and
denote two different images,
y and
, respectively, denote their probability of the corresponding class,
.
Since the data used in this paper are annotated with the bounding box of the lesion, we opted for image-level Mixup rather than classification Mixup. The generated label
and its confidence
in image-level Mixup can be calculated as follows:
where
denotes the
label on the generated image
,
x and
denote bounding boxes on two different images, and
y and
, respectively, represent their confidence,
.