DETR obtains comparable detection performance compared to the SOTA detection pipelines for the COCO dataset [
52], in spite of a much briefer pipeline. However, the detection results for small objects are not as good as the big ones, due to the fact that the attention mechanism intends to model the overall information of the whole image, rather than local details.
Figure 3 shows that a large number of defects account for less than a tenth of the entire image. This makes veneer defect detecting a challenging task for the original DETR. Furthermore, despite the fact that DETR needs much fewer manually designed components, such as anchors and non-maximum suppression, a new problem emerges, i.e., the format of positional encoding shall be predefined. However, the setting cannot be guaranteed to be the optimal one, since the selected format is subjective and depends on experience.
According to the analyses above, this paper proposed a new detection pipeline based on DETR, namely the
multiscale
position
encoding
net
detector (MPEND). The overall architecture is presented in
Figure 4 (where ResX means the xth residual unit, Con mean the convolutional Layer, ReLU means rectified linear activation, BN means batch normalization and Pos means positional encoding). MPEND includes three parts, the feature abstraction backbone, multiscale position encoding net (MPEN) and a revised DETR detector. In the first step, following the original DETR, the feature abstraction module is constructed by the residual network. The input image is downsampled 32 times, and three different shapes of feature maps are obtained. In the second step, the obtained feature maps are integrated with positional encoding. Instead of designing positional encoding manually, MPEND utilizes a multiscale position encoding net to learn position information from the input image itself. In the last step, the feature map coupled with positional encoding are used as the input of the Transformer encoder. The detecting procedure is performed by the DETR with a modified loss function.
4.1. Backbone
The backbone of MPEND is constructed by the classical residual structure [
53] with 15 convolutional layers, named RES15. A total of five residual structure are used, each of which have three convolutional layers followed by a max pooling layer. The main purpose of the max pooling layer is to downsample the input feature. In each convolutional layer, every convolutional unit is followed by the batch normalization. All the activation functions are ReLU. Only 1 × 1 and 3 × 3 convolutional kernels are adopted, following the experience of VGG. The details of the backbone are presented in
Table 2.
Starting from the initial image with three color channels, the backbone generates a lower-resolution activation map of . Typically, for the basic model of MPEND with RES15, we have and .
Many of the existing detection pipelines stack multiple residual units widely and deeply to improve the feature extraction capability of the backbone network. MPEND do not stack multiple residual units on the same layer to deepen the network width. This is because the feature extraction process of CNN-based pipelines almost entirely relies on the backbone network, while MPEND mainly obtains feature maps of different scales in the backbone network stage. Feature learning can also be realized in the following DETR model.
4.2. Multiscale Position Encoding Net
The original DETR used a feature map that is 32 times smaller than the input image. Although DETR can learn the overall relation of the whole map, the information of the small objects is dismissed. As a result, the small objects’ detection results of DETR are not as good as those of the big ones. Furthermore, both the original Transformer and the following DETR architecture adopted the trigonometric functions to generate, which is subjective and empirical. This section presents a multiscale position encoding net, including a multiscale feature maps module and an automatic position encoding net.
Multiscale feature maps. Multiscale feature maps have been verified to be an effective method for object detection. Similar to the tricks used in spatial PP and YOLOv3, three sizes of feature maps, , from different layers of the backbone are adopted as the input of the following detection module. On the map with a smaller size, the detector can model the overall information to grasp the features of big objects, while on the map with a bigger size, the detector will focus on formulating the fine-grained features in the local parts, which is effective for small object detection. Typically, for an input image with 512 × 512 pixels, the three sizes of feature maps are 16 × 16, 32 × 32 and 64 × 64, with regard to and , respectively.
One problem is that DETR only accepts input with a fixed length; hence, we collapse the spatial dimensions of into one dimension, resulting in a feature map.
In order to make the express style uniform and, furthermore, to facilitate the position encoding in the next step, the other two bigger maps are split into the size of
.
Figure 5 presents the flattening process for the smallest feature map, as well as the splitting results for the second feature map. The splitting process of the biggest feature map is similar to
Figure 5. Typically, the feature map of
is first split into four
maps, then each of them is split the same as in
Figure 5. The whole procedure is like a recursion.
It should be noted that the encoder and decoder used in the DETR detector are permutation-invariant. Therefore, from the feature view, split or not, the larger feature maps will not gain the amount of the feature information. However, this process is essential for the next position encoding.
Position Encoding Net. In the original Transformer, the sine and cosine functions are adopted for positional encoding. DETR adopts a generalization of the original Transformer encoding to the 2D case by independently using sine and cosine functions to yield a fixed absolute encoding to represent the spatial positions of images. Both of the encoding methods need manually designed formulas and introduce extra hyperparameters.
Given a picture of a bird, after a glimpse, we remember that bird and where it is. We have this memory not because we remember the position information, but because of the information it has in that area. That is, the information (or features) in different areas of an image itself has the position “encoding”. An intuitive assumption is that the information itself can be used to encode the position embedding.
Based on the analysis above, instead of directly encoding the location using extra formulas, we encode the position using information from different locations. Typically, for the feature map of the smallest size (16 × 16), a positional encoding branch is designed, as presented in
Figure 6. The input of the positional encoding branch is the output feature map of the last residual block. After padding, the feature map passes through a convolution layer with 3 × 3 kernels. The resulting feature map is combined with the original input. The last convolution layer has 256 kernels with 1 × 1 size, in order to compress the dimension. The output size of the PEN is the same as the input feature map, which is essential for positional encoding.
For the feature map with a larger size, two encoding strategies are designed, as presented in
Figure 7. The first encoding strategy (named PEN-1) is presented in the architecture and is detailed on the left side of
Figure 7. For the size
, the feature map is directly inputted into the PEN for positional encoding. The result is marked as
. For the larger size
,
is first upsampled to be the same size as
, then the upsampling result is inputted into PEN for positional encoding and the result is marked as
. Finally,
is also upsampled to the size of
, followed by the PEN. It should be noted that the three feature maps “approximately” share the same positional encoding map, since of all the input of the PEN is based on the smallest feature map. However, the three PENs used in the encoding processes are independent of each other.
The above description indicates that only the smallest feature map obtained from the last residual block is used for positional encoding. This is based on an intuitive assumption that the highly compressed feature is enough for the simple positional encoding. To verify this assumption, an extra encoding strategy (named PEN-2) is designed, as shown on the right side of
Figure 7. For each feature map, the input of the PEN is the feature map obtained from the output of the corresponding residual block. The upsampling step is removed. The resulting positional encoding maps are independent of each other. The effectiveness of these two strategies will be compared in the next section.
4.3. Loss Functions
The original DETR includes four parts, the backbone based on the residual network, transformer encoder, transformer decoder and prediction feed-forward network. The definition of the loss function is one of the most import steps for object detection. DETR infers a fixed-size set of
predictions, then an optimal bipartite matching is conducted between the predicted and ground truth objects and, finally, the object-specific (bounding box) losses are optimized. The match cost between ground truth
and prediction with index
is defined as:
where
is the ground truth set of objects and
is the set of
predictions.
is the probability of the prediction with index
of class
and
is the predicted box. Then, DETR finds a bipartite matching between the ground truth and the predictions by minimizing the following object function:
After finding the optimal matching set, the total loss function is defined as:
For the bounding box loss, a linear combination of the
loss and the generalized IoU loss [
54] is adopted:
where
are two hyperparameters.
The effectiveness of
loss has been proven in many machine learning problems. However, at the later stage of training, the loss function will fluctuate around the stable value, making it difficult to converge to a higher precision [
53].
In object detection tasks, the intersection ratio IoU is one of the most used metrics for performance evaluation. A higher value of IoU means a more accurate prediction result of the model. In the training stage, IoU can be used as the basis for dividing positive and negative samples in the anchor-based method. It can also be used in the loss function. However, IoU has a serious defect: if there is no overlap between two targets, IoU will be 0 and will not calculate the distance between two targets. In the case of such non-overlapping targets, if IoU is used as a loss function, the gradient will be 0, which cannot be optimized.
To overcome these drawbacks, Complete-IOU (CIoU) [
54] was proposed. In CIoU, a term was added to the end of IoU to calculate the minimum external rectangle of the two boxes, which is used to calculate the distance between the two boxes. This solves the problem of zero gradient when the two objects do not intersect. Furthermore, the standardized distance of the center points of the two Bboxes is minimized to accelerate the convergence process. At the same time, the aspect ratio of the boxes was also introduced to further measure the shape of the boxes. CIoU has been verified to achieve better convergence speed and accuracy for bounding box prediction problems. Here, we redefine the bounding box loss of DETR as
where
is the Euclidean distance between the center of the ground truth and the prediction.
represents the diagonal distance of the smallest enclosing rectangle.
is a penalty term considering the ratio of width and the height, i.e.,
From Equation (5), we can find that the two extra hyperparameters are removed and the loss function comprehensively considers the shape of the ground truth and the predictions.