Different objects with various shapes, densities, and thicknesses in X-ray imaging exhibit different degrees of scattering and absorption. These effects can impact the quality and clarity of the resulting images. Moreover, multiple objects may overlap in the image, making it difficult to distinguish their boundaries. This complexity poses challenges for detection tasks, especially when smaller prohibited items are obscured by larger objects, leading to missed detections and lower recall rates for the model. The original YOLOv5 model struggles to effectively address these issues. To tackle these challenges, we propose ScanGuard-YOLO, whose overall network architecture is depicted in
Figure 1. Specific input/output details are given in
Table 1. The SPPF module within the baseline model’s backbone is a spatial pyramid fusion module that samples the input feature map using pooling kernels of different sizes. It subsequently concatenates multiple sampled feature maps in the depth dimension to create a feature representation with rich semantic information. However, this module employs max-pooling for sampling, which discards some pixels in the input feature map, resulting in the loss of certain fine-grained details. The max-pooling operation requires comparing the pixels within each pooling window and selecting the largest as the output feature, which is computationally intensive. Hence, in this study, we replaced the SPPF module with the RFB-s module, which expands the receptive field without reducing resolution, enhancing the network’s ability to perceive global information. In the neck section, an efficient multiscale feature fusion module, termed efficient RepGFPN, was constructed. Its purpose is to enhance interactions between features of varying scales, allowing for better adaptation to changes in object scale across different scenarios. To cope with the data distribution characteristics of different X-ray imaged prohibited item datasets, a detection head containing an attention mechanism was introduced to enhance the correlation after fusing the features and to help the model focus on useful features. Finally, the bounding box regression loss was designed as a WIOUv3 loss to optimise the training strategy of the model for hard and easy samples to improve the accuracy of the model.
2.2.2. Efficient Multiscale Feature Fusion
The backbone network is usually a deep convolutional neural network for extracting feature representations of an image from shallow to deep layers, with different receptive fields and semantic information at different levels. The traditional feature pyramid proposes a top-down feature fusion strategy, in which the upsampled deep feature maps are sequentially fused with the shallow feature maps through a top-down path to obtain a feature pyramid with multiscale information. However, using only one top-down unidirectional information flow path, the deeper feature maps may have lost some detailed information [
23]. To compensate for the lack of unidirectional data flow, PANet [
24] adds a bottom-up feature propagation mechanism. PANet utilizes feature propagation modules to propagate feature information between different levels, enhancing the interaction between features at different levels. In order to improve the detection ability of different scale objects, we constructed an efficient multiscale feature fusion module, efficient RepGFPN, at the neck of the network, the structure of which is shown in
Figure 3.
In this structure, when fusing multiscale features, we employed a non-shared channel setting, which allowed for the preservation of the original feature information at each scale. Since the number of channels is not adjusted via convolution, the downscaling and upscaling operations of the features are avoided, which reduces the loss of information. At the same time, a richer feature representation is formed via the stacking features from different scales according to the original channels, which helps the network better understand and use the multiscale information. In the Simplify Rep Block structure, a structural re-parameterization mechanism is employed to eliminate branches that are present during training but not required during inference. This helps reduce the model’s computational requirements and memory footprint, thereby improving the model’s inference efficiency.
2.2.3. Dynamic Detection Head
In YOLOv5, 1 × 1 convolutions are used to perform classification and regression on the three scale features output by the neck. In the YOLOX model, the detection head’s classification and regression operations are decoupled, accelerating model convergence. To reduce model computational complexity, YOLOv6 [
25] employs a hybrid-channel strategy and reduces two 3 × 3 convolutions in the YOLOX detection head to one, achieving lower inference latency, referred to as the efficient decoupled head. All the above detection heads are only applied to feature maps of the same scale without considering multiscale contextual features. Zhuang et al. [
26] proposed a context-decoupled head that fused multiscale features, called the TSCODE head, whose structure is shown in
Figure 4. The detection performance of the model was further improved by fusing small-scale features with high-level semantic information for the classification task while using features from all scales for regression. However, the above detection heads do not consider unity, adaptability, and multi-task support. Furthermore, the performance of the decoupled head may not necessarily be superior to that of the coupled head after improvements to multiple modules of the network. Adjustments should be made based on actual application scenarios, and specific ablation experiments are described in
Section 3.4.2.
In X-ray prohibited item detection tasks, there can be various categories of prohibited items within packages, and these items come in different sizes and levels of occlusion. Therefore, it is essential for the detection head to possess scale-awareness capability. Secondly, the prohibited items to be detected may present different shapes and appear in any position in the image under different viewpoints, thus requiring spatial-awareness of the detection head. Finally, due to the varying contributions of different channel features to different tasks (e.g., classification and regression), it is necessary to dynamically adjust the detection head’s attention allocation to each channel based on the task type. Hence, there is a need to enhance the task-awareness capability of the detection head. For this reason, a dynamic detection head [
18] was constructed in the head, and its structure is shown in
Figure 5. The effective fusion between scale-awareness, spatial-awareness and task-awareness was achieved by introducing the attention mechanism, and the attention formula is expressed as:
where, given an input tensor
,
L was the number of different scale feature maps outputted through the neck section,
S was the product of the width and height of the feature maps,
C was the number of channels of the feature maps, and
was the attention function.
To simplify the formula, we still denote the output after the attention operation as
F. Thus, the formula representing the concatenation of spatial-awareness
, scale-awareness
, and task-awareness
attention mechanisms can be expressed as:
In the prohibited items detection task, objects can be deformed due to different viewing angles, and traditional fixed-shape convolution kernels and fixed-size receptive fields often struggle to adapt well to this situation. Therefore, we require a convolution operation that can adaptively adjust the receptive field and convolution kernel sampling positions to better capture the deformation information of the objects. The spatial-aware attention mechanism uses deformable ConvNets v2 (DCNv2) [
27] to adjust the size of the convolutional kernel receptive field. DCNv2 can not only adjust the offset of input features but also modulate the sampling point weights. By learning the offsets, it becomes possible to adaptively adjust the sampling positions of the convolutional kernel on the input feature map. This enables the network to better accommodate variations such as deformations and local changes in the presence of prohibited items at different locations. By learning the weights of sampling points, the network can adjust the sampling weights of convolutional kernels at different positions more effectively based on different local structures and their feature importance. The formula for spatial-aware attention is expressed as:
Here, was the input tensor, and c denoted c-th channel of the feature map. represented the number of sampled pixels in the convolution kernel, was the k-th sampled pixel in the convolution kernel, corresponded to its weight, denoted the learned offset, and was an importance scalar obtained via learning at .
Scale-aware attention performs adaptive average pooling, a 1 × 1 convolution, and a hard-sigmoid activation on the
S and
C dimensions of the features, which generates attention weights specific to different scale features. These weights dynamically fuse features of different scales based on their semantic importance, as shown in
Figure 5, which can effectively improve the performance of the model on the multiscale detection task. The scale-aware attention formula is expressed as:
In the equation, represented a linear function approximated by a 1 × 1 convolution, and denoted the hard-sigmoid activation function.
The task-aware attention was inspired by the dynamic ReLU [
28], which adaptively learns the importance of each feature channel and reallocates the weights of individual channels based on different tasks. The task-aware attention formula is expressed as:
Here, represented the c-th channel of the feature map, and denoted the hyperfunction for learning control activation thresholds. First, global average pooling was applied on the S dimensions of the feature map to compute the mean value of each channel, resulting in a fixed-size vector. The compressed feature vector was further processed through two fully connected layers, whose outputs indicated the relative importance weights of each channel in different tasks. To ensure that the task-aware attention weights have positive and negative values, a shifted sigmoid function was used to normalise the output within the range of [−1, 1]. Subsequently, these values were passed through the hyperfunction to generate four learnable parameters,. These parameters were used in subsequent maxout operations to activate different channels of the input feature map. In this way, the obtained attention weights can enhance the features of specific channels and suppress the features of other channels.
2.2.4. Bounding Box Regression Loss Function
The loss function is a metric that measures the discrepancy between model predictions and ground truth. It is used to evaluate the model’s performance and guide parameter updates. The loss function of YOLOv5 consists of classification loss, confidence loss, and bounding box regression loss. In order to improve the precision and recall of detection and to accelerate the speed of bounding box regression, we introduced WIOUv3 [
19] as the bounding box regression loss. WIOUv3 adopts a dynamic and non-monotonic focusing mechanism, compared to Focal-EIOU’s [
29] static focusing mechanism, which can more effectively balance the contribution of high-quality and low-quality samples to the loss function. Its formula is represented as follows:
where
and
denoted the coordinates of the centre of the ground truth bounding box, and
and
represented the coordinates of the center of the predicted bounding box. Furthermore,
and
stood for the width and height of the minimum enclosing rectangle of the predicted and ground truth boxes.
indicated that during the computation of the loss within the current batch,
was detached from the computation graph, making it devoid of gradient information.
was the exponential running average with momentum factor
, and
was the ratio between
and
, which characterized the outlier degree used to describe anchor box quality; a smaller outlier degree implies higher anchor box quality.
was the nonmonotonic focusing coefficient of
, with hyperparameters
and
during training. The WIOUv3 loss not only reduced the competitiveness of high-quality anchor boxes but also mitigated the harmful gradients generated by low-quality samples. This allows the network to focus more on anchor boxes of moderate quality, thereby enhancing the overall accuracy of object detection.