1. Introduction
Deep learning models for specific target detection has been applied in a wide range especially in the transportation industry [
1,
2]. In the security inspection scenario, there is also a need to use deep learning models to assist or even replace labor. In fact, there are already many new security machines that integrate contraband detection. Since the target to be detected is relatively fixed, the deep learning model actually performs quite well in this scenario.
However, the use of deep learning models to assist security checks has not been widely used. The main reason is that the input required by the deep learning object detection model is a single image. Although this function is not difficult to implement, the old security detectors of various customs and express delivery stations do not include the function of transmitting a single package image to the deep learning algorithm server at the beginning of the design, so if there is a need to use the deep learning model for contraband detection, the new security detector can only be replaced, resulting in a waste of resources.
At present, there are two main solutions to this problem:
The video image of the security detector is input to the algorithm server in a frame-by-frame or frame-skipping manner;
Through the combination of target extraction, target tracking and keyframe detection, the package image in the keyframe is input to the algorithm server.
In the first method, because the same package appears in multiple frames of the video, it leads to repeated detection of packages, resulting in a large amount of GPU overhead, higher detection costs, and incomplete packages in some frame images, which is prone to false detection.
The second method can ensure that each package is only detected once as much as possible, but the current commonly used grayscale-based target extraction method is not accurate, easy to cause incomplete package images.
In this scenario, there are generally two methods to achieve package extraction: one is the object detection algorithm based on deep learning, and the other is the unsupervised detection algorithm based on binarization. The advantage of deep learning-based target detection algorithm is that the detection results are accurate, but the disadvantage is that the shape of the target to be detected is highly uniform. Since it is a supervised model, when the shape of the target to be detected is different and not fixed, a convergent model cannot be trained, so the target cannot be detected. Meanwhile, the deep learning model has higher requirements on hardware computing power. With the second method, the package can be distinguished from the background by binarization of the image, and the location of the package area can be obtained by contour detection. Compared with deep learning object detection algorithm, this method has the advantages of less computation and no fixed object shape to be extracted. It is a lightweight unsupervised detection algorithm. It is widely used in the scenarios of pipeline and road foreign body detection. The limitation of the algorithm is that the background features need to be fixed, and the algorithm cannot be used when the background changes. The common binarization method has the disadvantage that the extraction target may not be complete. In this paper, some improvements are made to the method to improve extraction precision.
By binarizing the X-ray image and using the set threshold to remove the background information, the wrapped image is finally extracted using contour detection [
3,
4]. The contour feature extraction method depends on the relationship between the background gray level and the package gray level, which has certain limitations. When the background gray level is between the maximum and minimum gray level of the package, the method cannot work effectively because there is no threshold to separate the background and the package in this scene. Mei [
5] used the edge features to extract the contours of moving objects. Wu et al. [
6] used the edge detection operator to extract the golden region of the image. Their work indicated that image feature extraction by the edge detection operator is more sensitive to the texture region in the image. However, the features extracted from the image region with weak texture cannot form closure, which leads to errors in the subsequent contour detection, thus affecting the effect of target extraction. Tian and Liu [
7] used the binarization method of the LoG edge detection operator; when the image color of cardboard boxes, backpacks and other items is similar to the background color, the effect is better. This is because the binarization method based on the edge detection operator is more sensitive to the gradient change in the image, and it can accurately detect the intersection of the edge and background. But for large areas of the image with the same color and less texture, the edge detection operator cannot offer a better detection effect.
Threshold method is a classical target extraction algorithm which has been widely used in many fields [
8,
9,
10,
11]. The method optimizes a threshold through some algorithm, the final threshold value as the dividing line, greater than the threshold value for the change class and less than the threshold value for the unchanged class [
12]. The threshold method deals with determination of the threshold value. The maximum expected value algorithm is the EM algorithm. Based on the segmentation method of automatic selection of multiple thresholds, Li [
13] used the watershed algorithm based on markers to extract the image histogram and obtain multiple thresholds. On the basis of threshold segmentation, the watershed algorithm is used to segment the image, so as to extract the region of interest in the X-ray image. Bruzzone [
14] applied the maximum expected value algorithm of Gaussian model to the analysis of difference graph. The difference graph was modeled. The varying pixels and invariant pixels obey the Gaussian distribution, respectively, and the threshold value is obtained through multiple iterations. The parameters of the model are obtained through multiple iterations of EM algorithm. Bazi et al. [
15] applied the KI (Kittler–Illingworth) threshold method to the difference graph analysis algorithm. The KI threshold method is based on the Bayes theory of minimum error. Later, the researchers improved the model to produce a generalized KI threshold algorithm [
16]. The threshold method is simple and fast, but it cannot use spatial information effectively.
This paper focuses on the accurate extraction of the luggage package target. An ES-MBD (Edge Sensitive Multi-channel Background Difference Algorithm) method based on edge-sensitive multi-channel background difference is proposed to achieve more accurate image binarization. In addition, the Suzuki contour detection algorithm is adopted to detect the contour of the binary image obtained by the ES-MBD method to subsequently match the area of the package box and then make a judgment. The detection result is output as the key target of the X-ray image.
The main contributions of this paper are as follows:
- (1)
In this paper, the ES-MBD method is proposed. It solves the problems of low detection efficiency and high GPU overhead due to the unfixed shape of luggage packages, stacking, and occlusion in the video. The ES-MBD method is able to improve the detection efficiency and extract a more complete image of luggage packages.
- (2)
The ES-MBD method combines the background difference binarization and edge detection operator binarization, which solves the problem of the binarization method of background difference being insensitive to texture features, while the binarization method based on edge detection operator is insensitive to smooth regions. Through experimental comparison, the precision rate of the ES-MBD binarization method reaches 97.3% and the recall rate reaches 96.5%, and the ES-MBD method has obvious advantages in the luggage package target extraction of X-ray images.
- (3)
The Sobel operator optimized by local gradient enhancement performs edge detection on the grayscale image. The loss of local information can be reduced, and a better detection effect can be obtained. Using the Suzuki algorithm, the binary image contour obtained by the ES-MBD method can be detected. The proposed ES-MBD method can solve the problem of information loss in the traditional binarization method and preserves the information of the insensitive area while reducing the noise.
The rest of the paper is organized as follows:
Section 2 summarizes the object extraction method of the X-ray image.
Section 3 analyzes and compares different binary methods.
Section 4 proposes an optimized Sobel operator with local gradient enhancement and presents the ES-MBD method.
Section 5 introduces the key target tracking and extraction process.
Section 6 provides experimental results and analysis.
Section 7 concludes the paper.
5. X-ray Image Key Target Tracking and Extraction
5.1. Suzuki Contour Detection
Profile detection is a method of acquiring connected areas in an image through binarization processing, and the Suzuki contour detection algorithm proposed by Satoshi Suzuki is commonly used [
34]. The four boundaries of the image are called the frame of the image, and for an image with width w and height h, it is regarded as a matrix of order h×w composed of pixels; then, Rows 1, h, 1, w of the matrix constitute the frame of the image. A pixel with a gray value of zero is called a zero pixel, and a pixel with a gray value of one is called a one pixel. In this algorithm, the frames of the binarized image are assumed to be zero pixels, and if the frame of the input image has one pixel, it is changed to zero pixels.
Figure 19 is an example figure of Suzuki’s algorithm, in which pixels with the same absolute value belong to the same boundary, and the relationship between each boundary is recorded on the right side of the figure, where ob is the outer border, hb is the hole border, and the parent border means that the outer layer is the parent of the inner layer.
For the pre-extraction of X-ray images, the foreground that needs to be extracted is actually the outer boundary of the frame in all parent boundaries in the contour inspection results.
5.2. X-ray Image Key Target Tracking
- (1)
Target tracking algorithm
The binarization method can extract the target in each frame picture. In order to find out the corresponding relationship between each target in the adjacent frame, it is necessary to use the target tracking algorithm to track the package box extracted based on the object as input. The steps are as follows:
Step 1: The frame in which packages appear for the first time is taken as the initial frame. Each package detected in this frame is assigned a unique number and stored in the cache;
Step 2: A new frame is detected. Each package detected by a new frame with all the packages in the cache is compared, the most matched cache package is selected, they are considered to be the same package, is set, and the cache of the package is updated;
Step 3: If package is detected in the new frame of Step 2 and does not match the cache, it is considered as a new package, and a new unique number is assigned to it, and it is included in the cache;
Step 4: If the cache package in Step 2 does not match the cache package of the new frame, the cache of the package is deleted.
- (2)
The selection of loss function
The function used to evaluate the degree of package matching is called the loss function used by the tracking algorithm. The overlapping degree of package boxes is used as the loss function to evaluate the degree of match, and the intersection ratio (IoU) and its variants are adopted.
IoU is the most commonly used index in target detection, and its definition is shown in Formula (10):
IoU can reflect the overlap degree of two targets, and it has scaled invariance. However, there are some disadvantages of IoU.
If two objects do not intersect, according to definition, IoU = 0. Then, IoU cannot accurately reflect the degree of overlap. As shown in
Figure 20, IoUs are equal in all three cases, but the degree of coincidence is not the same. The graph on the right has the best regression and the graph on the left has the worst regression.
GIoU’s thought is put forward by Rezatofighi H et al. in CVPR2019 [
35]. Its definition is shown in Formula (11):
First, the minimum closure area of the two boxes is calculated to determine the IoU; then, the proportion of the closure area that does not belong to the two boxes to the closure area is computed, and finally this proportion is subtracted from IoU to obtain GIoU.
Like IoU, GIoU is a distance measure, and it is insensitive to scale. At the same time, GIoU is the lower bound of IoU; in the case that the two boxes coincide infinitely, IoU = GIoU. On the other hand, IoU takes the value of [0, 1], but GIoU has a symmetric interval, taking the value range of [–1, 1]. The maximum value 1 is taken when the two coincide, and the minimum value −1 is taken when they have no intersection and are infinitely far away, so GIoU is a very good distance measure. Different from IoU, which only focuses on the overlap area, GIoU focuses on not only the overlap area, but also other non-overlap areas, which can better reflect the overlap degree of the two.
In order to make the target frame regression more stable, for the scene of target frame regression, the distance, overlap rate and scale between the target and anchor are taken into account to obtain DIoU [
36], which does not have the divergence problem in the training process like IoU and GIoU. Its definition is shown in Formula (12):
where
b—the center point of the prediction box;
—the center point of the real box;
—the Euclidean distance between two center points;
—the diagonal distance of the minimum closure region that can contain both the predicted box and the real box.
Similar to GIoU, DIoU can provide direction of movement for the bounding box when it does not overlap with the target box. DIoU can directly minimize the distance between the two target boxes, which converges much faster than GIoU. As for the case involving two boxes in both horizontal and vertical directions, the DIoU can make the regression very fast, while the GIoU almost degenerates into the IoU. DIoU can also replace the normal IoU evaluation strategy and be applied to the NMS to make the results of the NMS more reasonable and effective.
The aspect ratio of the three elements of bbox regression (bounding box regression) is not considered in the calculation, and CIoU is proposed on the basis of DIoU. Its definition is shown in Formula (13):
where
is used to measure the similarity of the aspect ratio, as shown in Formula (14):
The above four loss functions are used to judge the degree of coincidence of two targets in target detection and related fields. Since IoU has many shortcomings, DIoU and CIoU loss functions are introduced to improve the target frame regression. In the case of target tracking, GIoU is more suitable for relatively simple use.
5.3. Extraction of Key Targets in X-ray Images
The detection and tracking algorithm can accurately locate the location of each package in each frame image. To intercept each package completely and without repeating, it is necessary to extract the key target of the video [
37,
38,
39]. That is, in the process of a package from entering the picture to leaving the picture, a frame is selected in which the package is completely displayed as the key target of the package, and the package area of the frame is cut out as the image of the package. Since the X-ray machine usually moves the package in a fixed direction, assuming that the package moves from right to left, the extraction method is as described below.
A trigger line is set at a distance to the right of the image. For any package and frame containing package , represents the position relationship between the right edge of the package frame and the trigger line. is when the right edge of the package frame is on the right side of the trigger line; otherwise, .
We let be all the video frames of a package appearing in the picture, and sort them in chronological order, that is, the next frame of is .
Since the package enters from the right at the beginning, before the package passes through the trigger line, as shown in
Figure 21a, the right boundary of the package frame must be on the right side of the trigger line at the beginning, and obviously there is
that makes
As the package moves to the left, as shown in
Figure 21b, after the package passes through the trigger line, there must be
that makes
where
is the frame that passes through the trigger line for the first time on the right side of the package frame, and
is the key frame of the package used to obtain the complete image of the package.
X-ray package detection tracks the extraction process, as shown in
Figure 22.
Step 1: The ES-MBD method is used to process the input video frame, calculate its multichannel background difference, and use the Sobel operator to process the gray image. The two are binarized and denoised respectively, and then merged. The combined results are expanded by morphology to obtain the binarized image.
Step 2: The Suzuki algorithm is used to detect the contour of the binary image. The outer boundary whose parent boundary is the frame in the result is selected as the detected package box area.
Step 3: The package box region obtained in Step 2 is matched with the package in the cache, and the cache is updated. At the same time, the key frame judgment is carried out for the successfully matched package.
Step 4: If there is a package with successful key frame judgment in Step 3, the package box area is intercepted as the image output of the package.
6. Experimental Results and Analysis
Several X-ray machine security videos are selected, and different binarization methods are used to compare the extraction of key targets. There are 113 packages in the video.
Figure 23 shows some package binarization images obtained by each method.
As can be seen in
Figure 23, there are many incomplete packages in the method based on gray binarization. Some packages are missing or truncated in the method based on background difference and the method based on Sobel operator binarization, while the method based on ES-MBD binarization still maintains the integrity of packages after removing noise.
Comparing the intercepted package image in the key target with the actual package image, if the intercepted package image is complete, the package is considered to be successfully detected. In addition, if the adjacent package is detected as a whole package by the algorithm, detection fails. The detection results of different methods are shown in
Table 2.
The package detection problem in this paper is not a binary classification problem. Binary classification is a deterministic quantity problem for both P (positive) and N (negative). Packages are treated as positive class. Package detection does not set a fixed value for N (background is negative class), so it is an atypical binary classification problem. The evaluation of this algorithm is performed with the help of evaluation metrics (precision, recall, , accuracy).
In this paper, the algorithm detection results are evaluated using the confusion matrix which includes the following four values:
True Positive (TP): Positive samples are detected as positive samples, i.e., packages are detected as packages, indicating the number of correctly detected packages.
False Positive (FP): Negative samples are detected as positive samples, i.e., the background is detected as packages, indicating the number of misdetected packages.
False Negative (FN): Positive samples are detected as negative samples, i.e., packages are detected as background, indicating the number of missed packages.
True Negative (TN): Negative samples are detected as negative samples, i.e., the background is detected as a background, and since the algorithm only detects packages, the term is constant to zero.
The confusion matrix allows us calculation of the following evaluation metrics:
Precision represents the proportion of positive samples that are detected correctly:
Recall represents the proportion of all input positive samples that are detected:
is a harmonic average of precision and recall used to avoid a situation where one is high and the other is low. The higher the F1-Socre, the better the algorithm works:
Accuracy represents the proportion of positive and negative samples that the algorithm detects correctly overall. As this algorithm only focuses on the detection effect of positive samples, this metric is only for reference:
The evaluation of the detection effect of each method is shown in
Table 7.
As can be seen in
Table 7, the precision rate of the gray binarization method is 47.8%, the recall rate is 48.7%, and the
is 0.482. Accuracy is a global metric and is related to both positive and negative samples. Because luggage package detection does not focus on negative samples in this paper, the accuracy rate is not discussed in the result analysis. Among the four binarization methods, the overall effect is the worst. Due to the simple structure, less image information and sensitive background noise, the traditional gray binarization method cannot completely extract the object outline of the luggage package in the X-ray image. The precision rate of the background difference binarization method is 74.8%, the recall rate is 78.8%, and the
is 0.767. The main problem of the background difference binarization method is that it cannot detect some edge areas of the package whose color is not obvious, resulting in incomplete package. The precision rate of the Sobel operator binarization method is 51.9%, the recall rate is 61.9%, and the
is 0.565. The Sobel operator binarization method is insensitive to large package regions with fewer homochromatic textures, which can lead to many packages being split into multiple parts with a very high number of false detections. The ES-MBD method combines background difference binarization with Sobel operator binarization, which can avoid the shortcomings of both. The precision rate reaches 97.3% and the recall rate reaches 96.5%. The results prove that the ES-MBD method has an obvious detection effect.