Next Article in Journal
Corrosion Effects on the Mechanical Properties of Spun Pile Materials
Previous Article in Journal
Security Requirement Recommendation Method Using Case-Based Reasoning to Prevent Advanced Persistent Threats
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

EYOLOX: An Efficient One-Stage Object Detection Network Based on YOLOX

1
College of Information Science and Technology, Northeast Normal University, Changchun 130117, China
2
Institute for Intelligent Elderly Care, Changchun Humanities and Sciences College, Changchun 130117, China
3
Key Laboratory of Applied Statistics of MOE, Northeast Normal University, Changchun 130024, China
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2023, 13(3), 1506; https://doi.org/10.3390/app13031506
Submission received: 7 December 2022 / Revised: 19 January 2023 / Accepted: 21 January 2023 / Published: 23 January 2023
(This article belongs to the Section Computing and Artificial Intelligence)

Abstract

:
Object detection has drawn the attention of many researchers due to its wide application in computer vision-related applications. In this paper, a novel model is proposed for object detection. Firstly, a new neck is designed for the proposed detection model, including an efficient SPPNet (Spatial Pyramid Pooling Network), a modified NLNet (Non Local Network) and a lightweight adaptive feature fusion module. Secondly, the detection head with double residual branch structure is presented to reduce the delay of a decoupled head and improve the detection ability. Finally, these improvements are embedded in YOLOX as plug-and-play modules for forming a high-performance detector, EYOLOX (EfficientYOLOX). Extensive experiments demonstrate that the EYOLOX achieves significant improvements, which increases YOLOX-s from 40.5% to 42.2% AP on the MS COCO dataset with a single GPU. Moreover, the performance of the detection of EYOLOX also outperforms YOLOv6 and some SOTA methods with the same number of parameters and GFLOPs. In particular, EYOLOX has only been trained on the COCO-2017 dataset without using any other datasets, and only the pre-training weights of the backbone part are loaded.

1. Introduction

Object detection is a hot direction of computer vision and digital image processing, which has been widely used in many fields such as robot navigation [1], intelligent video surveillance [2], automatic driving [3,4,5], industrial monitoring [6,7,8], aerospace and so on. It can reduce the consumption of human capital and has important practical significance. In addition, object detection is also an important prerequisite for many high-level visual processing and analysis tasks, such as behavior analysis [9,10,11] and scene semantic understanding [12,13,14].
In recent years, with the development of object detection, one-stage object detector has gradually overtaken two-stage object detectors. Since the one-stage detector differs from the two-stage detector in that it does not need to first generate the region of interest through the region proposal network, it has gained popularity due to its balance of speed and accuracy. For example, RetinaNet [15] first surpassed the two-stage object detector as a milestone, and the YOLO series [16,17,18,19,20,21] are the most used in the industrial field, etc.
The ability of a detector for detecting objects with various sizes is an important indicator to evaluate its performance. In particular, the detection of small objects has always been a challenging task. Thus, in order to solve the problem of poor detection of small objects in SSD [22], Tsung-YiL et al. [23] proposed FPN to cope with the problem of poor semantic information of the superficial features. After that, many subsequent one-stage object detectors used multi-scale feature fusion with multi-scale prediction to improve the performance [15,17,24]. Moreover, some variants of FPN have also been proposed to optimize the original structure, such as AugFPN [25], NAS-FPN [26], PANet [27], etc. However, two main limitations of FPN have been pointed out by some researchers. One is the problem of computational complexity, and the other one is that many of the features fused by FPN are contradictory information [28]. Therefore, whether good detection can be achieved without using FPN becomes our motivation.
Detectors with FPN FPN enriches the semantic information of the low-level features by fusing the high-level features with the low-level features, and the lateral connections maintain multi-scale prediction. Applying FPN both to two-stage object detector (Faster R-CNN [29]) and the one-stage object detector (RetinaNet) has significantly improved the detection performance. Hence, FPN and its variants have made a splash in object detection, and YOLOv3 [17] used FPN to achieve a balance of accuracy and speed that is still popular in industry today. PANet was applied in YOLOv4 [18] and YOLOv5 [19], which added a part of the fusion of low-level features to high-level features on the basis of FPN. EfficientDet [30] proposed BiFPN to iteratively perform bidirectional feature fusion. Unlike the above methods, FPN is not used as an encoder for object detection model in this paper.
Detectors without FPN In the early days, Fast-RCNN [31] used features generated by Roipooling layers to make predictions. YOLO [20] and YOLOv2 [21] only employed the last output feature of the backbone for prediction. They are fast, but the deficit of performance is also very obvious compared with applying multi-scale prediction. CenterNet [32] and CornerNet [33] got better results by downsampling four times as many features for prediction. However, huge memory consumption is brought by using high-resolution features. DETR [34] introduced a transformer into object detection and achieved state-of-the-art results by only using the last layer of features. Nevertheless, the long training time of DETR and the difficulty of convergence are inherent disadvantages, and DETR is also powerless for small objects. YOLOF [35] believed that the success of FPN did not contribute to the fusion of multi-scale features but benefited from the divide and conquer. It expands the receptive field of single-level features using inflated convolution, which outperforms DETR in terms of speed and accuracy. The above detectors have rich semantic information but weak positional information. Although the detection performance is satisfied for detecting large objects, it is still unsatisfactory for detecting small objects. Different from the above methods, the multi-scale prediction method is applied as the main structure of our proposed new neck, and the NLNet [36] structure of the segmentation domain is introduced to enhance the representation ability of shallow features.
Non-local Neural Network CNNs use convolution kernels of different sizes to extract local features of the feature map. The adjacent regions of the feature map can only interact by stacking consecutive convolutional blocks, which is too inefficient to capture long-range dependencies. NLNet computes the dependency of each feature point with the full map by the self-attention mechanism. Nevertheless, the self-attention mechanism needs to perform dot product operations among matrices, which will consume a lot of memory. Therefore, some works [37,38,39,40] directly or indirectly addressed the problem of excessive computation of the self-attention mechanism. GCNet [39] performed only one dot product operation, which would directly reduce the amount of calculation. ANNNet [40] reduced the size of Key and Value by means of sampling for decreasing the amount of computation in the dot product operation. Unlike some existing methods for self-attention computation, this paper proposes a cheaper operation to lessen the computational burden.
Decoupled Head Zheng Ge et al. [16] mentioned in YOLOX that the traditional coupled head classification and regression prediction were all obtained by a single convolution, which would bring a conflict between the two tasks. In consequence, the decoupled head is used for classification and regression tasks separately through a parallel structure. It not only improves the accuracy but also makes the convergence faster. This is not the first time that a parallel structure detection head is used in YOLOX. In some previous detection tasks, both FCOS [24] and RetinaNet utilized a detection head similar to the decoupled head. Although the decoupled head performs better than the traditional coupled head, the stacked convolutional blocks on the parallel branches cause a delay. In testing, we also found that if we additionally increased the number of predicted branches of the detection head, the convergence is faster during training. However, this also results in significant memory consumption. Unlike some parallel structure detection heads, this paper proposes a more efficient decoupled head for the task of object detection.
The reparameterization from RepVGG [41] is a better detection technique because its layer fusion operation makes better use of the GPU. Currently, YOLOv6 [42], YOLOv7 [43] and PPYOLOE [44] all utilized structural reparameterization techniques to multiply the FPS by using TensorRT [45]. However, it was found that reparameterization imposed a huge training burden, which is not friendly to low-end GPU devices through some experiments. Fortunately, YOLOX has achieved good real-time performance without applying the structure of RepVGG. Meanwhile, YOLOX is also an anchor free detector, which does not need to carefully design a prior box for a specific detection task. Therefore, YOLOX is chosen as the baseline network to evaluate all proposed modules in this paper.
To sum up, a novel network is built for object detection without using FPN, named EYOLOX in an encoder-decoder architecture. Firstly, a novel neck is developed by combining the detection and segmentation methods. Secondly, an efficient detection head is designed by using a simple channel assignment strategy and the structure of the double residual branch. In particular, a low-cost operation in self-attention computation is put forward for reducing the computational burden. Finally, the effectiveness and superiority of EYOLOX is evaluated by a large number of experiments, and the proposed EYOLOX can achieve higher detection accuracy with less parameters.
The contributions of this paper can be summarized as follows:
  • In order to make better use of the features extracted by backbone, a new neck is designed for object detection that is an alternative to the mainstream methods and may break the bottleneck of some existing works with FPN;
  • An efficient detection head is put forward, which exploits simple methods to achieve higher performance.
  • A low-cost operation for non-local neural networks in self-attention computation is introduced to improve the calculation speed.
  • The proposed method exhibits the more powerful object detection and increases YOLOX-s from 40.5% to 42.2% AP in the COCO-2017 dataset.

2. Materials and Methods

In this section, the backbone, neck and head of EYOLOX will be described, respectively. The overall network structure diagram of EYOLOX is shown in Figure 1.

2.1. Backbone

EYOLOX still uses CSPDarkNet53 [18] as the backbone network. On the one hand, it splits features in the channel dimension by 1 × 1 convolution, and the split features will be used for feature extraction and final stacking, respectively. This design has enabled DarkNet53 [17] to gain richer gradient mix and fewer parameters than some backbone networks of ResNet [46,47]. Additionally, CSPDarkNet53 also maintains the accuracy on the basis of lightweight. On the other hand, it facilitates the comparison with YOLOX, and its effectiveness as a backbone of our proposed method has also been verified.

2.2. Neck

The proposed new neck includes our designed CSPD Block (CSPNet Dilated), CSPFSA Block (CSPNet Focus Self-Attention), CSPSPPF (CSPNet SPPF) and SimASFF (Simplified ASFF).
CSPD Block The features output from the backbone network need to be further processed before being fed to the detection head. Among them, the last layer of features with rich semantic information is the most important for object detection. Its receptive field is expanded in YOLOF by stacking dilated convolutional blocks with different expansion rates. The reason for using different expansion rates is for preventing gridding effects. The last layer of EYOLOX feature is processed by the dilated encoder proposed in YOLOF, which is used to combine with the CSPNet [48] structure. That is, it reduces the computational effort and enriches the gradient information. In particular, CSPNet would stack half of the original features as a shortcut with the other half of the processed features. Different from CSPNet, CSPD Block adds the CBAM [49] attention mechanism on the shortcut to make the model pay more attention to some important information, as shown in Figure 2.
CSPFSA Block The computation of the self-attention mechanism is expensive, and researchers have tried to reduce its computational effort in recent years. The Focus operation in YOLOv5 brings inspiration to reduce the computation of self-attention, which is a way of trading the spatial dimension for the channel dimension. The patch merging operation in Swin-transformer [50] also used this operation, and they also pointed out that the commonly used Maxpooling layer or ordinary convolution with a stride size of 2 for downsampling caused information loss. CCNet [37] and ISANet [38] divided the calculation of the whole feature map into two parts to complete the calculation of self-attention in an indirect way. However, they need to process the features on the original feature map, so that the divided patch will be multiplied by the batch size and then passed into the encoder. If the training batch is set too large and the number of patches is too much, the video memory will overflow as the GPU needs to process too much data at one time. Yet, Focus Self-Attention Block can effective avoid calculating the whole feature map directly according to the idea of obtaining feature points in the Focus module, as shown in Figure 3.
To be more specific, the Q(query), K(key) and V(value) used for the calculation of self-attention mechanism are divided into four matrices with a quarter of size of the original matrix by obtaining values at intervals. This method takes up less memory at the cost of increasing the number of calculations.
However, Focus Self-Attention Block only computes each quarter of the entire feature map separately. In order to perform global attention computation, it is necessary to exchange information between the four regions through inexpensive operations. For this reason, each 2 × 2 region on the feature map includes the features from four different regions after restoring the feature map to its original position. Cheap operations utilize 2 × 2 convolutions for information interaction across regions, instead of doing expensive self-attention computations on those regions again. Although there is not much information related between neighboring regions of the feature map as represented in Moblie-Vit [51], it makes sense to global interaction by a single convolution operation.
The self-attention mechanism is to make the model pay more attention to the location information, while it ignores the channel information of the features. DANet [52] used a parallel structure to pay attention to both location and channel information. CBAM used a series structure to make the channel attention merge with the spatial attention. Since the non-local neural network model has learned enough location information, ESE [53] blocks are concatenated to increase the model’s attention to channel information. It has acted as an efficient version of SENet [54] to solve the problem of information loss during the dimensionality reduction and increase of SENet. Thus, CSPNet Focus Self-Attention Block is presented as Figure 4.
CSPSPPF SPPNet [55] was early used to effectively avoid some problems such as image distortion caused by cropping and scaling operations. It is later used for downstream tasks in computer vision to provide multiple fields of perception for deep features using maximum pooling layers of different sizes. The authors of YOLOv5 proposed SPPF (SPP-Fast) in an updated version to reduce the latency caused by large convolution kernels. In the recent YOLOv6 and YOLOv7, the SPP has also been improved by proposing SimSPPF and SPPCSPC, respectively. After combining the methods in the state-of-the-art detectors and analyzing the advantages and disadvantages, CSPSPPF is introduced in our work. This new SPP module uses group convolution to solve the problem of large number of SPPCSPC parameters and adopt ReLU activation function to replace SiLU, which makes the whole module faster, as shown in Figure 5.
SimASFF In ASFF [28], it is considered that the feature fusion of FPN participates in a lot of contradictory information. As a result, it filters some useless information by assigning weight. The weighted operation can be formulated as:
Y l 1 = α X l 1 + β X l 2 + γ X l 3
where [ X l 1 , X l 2 , X l 3 ] are the features from different layers, respectively. The learnable parameters [ α , β , γ ] are calculated by S o f t m a x on the weights obtained at each layer, which can be formulated as:
W = S o f t m a x ( [ w e i g h t 1 , w e i g h t 2 , w e i g h t 3 ] )
( α , β , γ ) = ( W [ , 0 : 1 ] , W [ , 1 : 2 ] , W [ , 2 : ] )
Nevertheless, it is thought that the features from the corresponding layer contain more important information when doing multi-scale prediction. Thus, for a certain layer, only the features from other layers are weighted to filter contradictory information. In this way, the information of the original layer is completely preserved, and the contradictory information is filtered. This can be formulated as:
Y l 1 = X l 1 + α X l 2 + β X l 3
W = S o f t m a x ( [ w e i g h t 2 , w e i g h t 3 ] )
( α , β ) = ( W [ , 0 : 1 ] , W [ , 1 : ] )
In EYOLOX, we only consider the information filtering of low-level features and intermediate features by weighting operation. The high-level features processed by CSPSPPF and CSPD Block already possess rich semantic information and sufficient receptive fields. Consequently, high-level features will not be processed by SimASFF to avoid the impact of conflicting information and reduce the amount of computation.

2.3. Head

ERDHead YOLOv6 used a hybrid channels strategy to redesign the decoupled head for alleviating the latency problem. PPYOLOE first introduced the original features from the encoder as a single residual operation in the detection head, and it has pointed out that separating the classification and regression tasks led to a lack of specificity in the overall model learning. ET-Head yet only introduces a residual branch on the classification branch to enhance specificity learning. In contrast, the proposed ERDHead (Efficient Residual Decoupled Head) uses a hybrid channels strategy and introduces a residual branch on the location regression branch. With these two improvements, ERDHead has double residual branches to compensate for the performance loss caused by the independent execution of classification and regression tasks and, further, to reduce the latency. Through experiments, we also found that the accuracy of position prediction can be improved by connecting the features from the neck as a residual edge to the regression branch without increasing the extra computational burden, as shown in Figure 6.

3. Results and Analysis

All experiments are trained on COCO-2017 training set. For the comparison of ablation experiments and other existing methods, the COCO-2017 validation set will be used for evaluation.
Implementation Details Extensive experiments are conducted in the same environment. The model is trained for a total of 300 epochs with 5 epochs warmup, and stochastic gradient descent (SGD) is used for training. The learning rate is set as lr × batch size / 64, with an initial lr = 0.01 and the cosine lr schedule. The weight decay is 0.0005, and the SGD momentum is 0.9. The input-size is evenly drawn from 448 to 832 with 32 strides. Furthermore, the model is trained with a batch size of 64 on a single NIVDIA A40, and the training program is almost identical to YOLOX, except for the use of device and the setting of the batch size.
Ablation Experiment The YOLOX with PANet removed is used as a baseline and the final result of EYOLOX is compared with YOLOX-s. As shown in Table 1, the performance is reduced from 40.5% to 33.1% (−7.4% AP) after PANet removal. When the CSPD Block is added, the performance is improved from 33.1% to 37.0%(+3.9% AP). In particular, the ability of detecting large objects is 3.4% higher than YOLOX-s. When the CSPFSA Block is employed, the performance of the detector increases from 37.0% to 38.6% (+1.6% AP). However, the computation of self-attention requires a dot product of the matrix which results in a 15% drop in FPS.
From Table 1, we can clearly find that the amount of parameters of the model becomes larger after adding the CSPFSA Block because of the adjustment of the detection head channel dimension and the increased number of parameters.
When CSPSPPF is appended, the detection performance improves from 38.6% to 39.3% (+0.7% AP) without increasing the number of parameters and GFlOPs. Here we found that although the detection performance for detecting large objects is already high, the detection performance of small objects is lower than YOLOX-s by 4.3% AP. Fortunately, this problem is solved when the SimASFF module is introduced, which increases the performance from 39.3% to 40.8% (+1.5% AP) and increases the APS from 18.9% to 22.5% (+3.6% AP). However, the accuracy of APL descents by 2.4%, which may be caused by incomplete filtering of contradictory information and lead to the uneven distribution of samples and affect the calculation of loss. However, compared with the contribution of SimASFF to the overall performance, the slight impact on the detection of large objects is acceptable. Ultimately, we replace the decoupled head with the ERDhead and improve the performance from 40.8% to 41.4% (+0.6% AP). By this point, our proposed method exceeds YOLOX-s by 0.9% AP.
Cheap Operation versus Expensive Operation The cheap operation represents the process of interacting with the feature map using the convolution of 2 × 2 in our proposed CSPFSA Block, and the expensive operation represents the computation of a small region containing information about the four parts of the feature map still using the Non-local Network. From Table 2, although we can see that the detection results are almost the same for the way we interact with small regions by 2 × 2 convolution and for the way to compute self-attention on small regions, the results obtained by the cheap operation are about 13% faster in inference. Additionally, the training time is less than that of the expensive operation under the same conditions.
Data Augmentation Strategy The authors in YOLOX mentioned that assigning data augmentation strategies corresponding to the size of the model could help improve performance. Some previous experiments have removed the MixUp [56] and weaken the mosaic (reduce the scale range from [0.1, 2.0] to [0.5, 1.5]). However, when MixUp as well as a larger range of mosaic was used in EYOLOX, the detection performance has been improved substantially, as shown in Table 3. We also found the different performances of the two strategies during the training process, as shown in Figure 7. If EYOLOX uses the augmentation strategy of YOLOX-s, the AP will rise faster during the training process. When the data augmentation is turned off in the last 15 epochs, the AP will come out with a moderate increase, as shown in Figure 7a. After reaching the peak, the phenomenon of overfitting occurs. If the augmentation strategy is increased, the AP rises gently with the training time during the training. When the data augmentation is off in the last 15 epochs, the AP increases dramatically in a very short period of time, and then overfitting also occurs, as shown in Figure 7b. In addition, it can be seen from Figure 7 that no matter which data augmentation strategy is used, the model iteration with 300 epochs is the optimal choice, which is consistent with the method of YOLOX. Clearly, the model gradually tends to be stable or even shows a slight downward trend after 250 epochs. At this time, the model has begun to overfit without closing data augmentation. Subsequently, the accuracy of the model gets higher returns with closing data augmentation.
Comparison with Other SOTA Detectors Table 4 shows the comparison of our method with other state-of-the-art detectors on the COCO-2017 validation set. The performance of our detector has been evidently improved on the basis of YOLOX. Except for the performance of small objects is only improved by 0.3% AP, all other evaluation indicators show more than 1.0% AP performance improvement. Specifically, EYOLOX has a significant improvement in its ability to detect large objects, with an APL improvement of 4.3% AP over the YOLOX-s. The results of EYOLOX clearly outperform the PAI-YOLOX [57] proposed by Alibaba team and YOLOv6 proposed by Meituan. To be specific, the performance of EYOLOX is 0.8% AP higher than that of PAI-YOLOX, which obviously has a greater advantage. Furthermore, the performance of EYOLOX is 1.1% AP higher than that of YOLOv6-T, and 1.9% AP higher than that of YOLOv6-T with trained for the same epoch. Compared with these SOTA detectors, EYOLOX uses a smaller number of parameters and GFLOPs. In addition, it outperforms some larger models. For example, the size of EYOLOX model is three times smaller than others. It is still higher than DETR and YOLOF by 0.2% AP and 0.6% AP, respectively. However, DETR with transformer architecture is 2.7% higher on APL compared with EYOLOX due to the absolute advantage of transformer in extracting features. It is evident from Table 4 that the detector using only single-layer features for prediction is weaker on APS, which also proves the importance of multi-scale prediction. Since single-layer features lack detailed information for detecting small objects, using multiple-scales predictions can make up for the loss of details. However, shallow semantic information still needs to be compensated. FPN used in YOLOv5-SPD-s can effectively solve this problem. Although YOLOv5-SPD-s in the table increases the APS of YOLOv5-s by 2.3% by avoiding the information loss caused by feature size reduction, this method is relatively expensive. In comparison, EYOLOX has the same APS as YOLOv5-SPD-s [58], mainly benefiting from the self-attention mechanism to obtain richer semantic information, while the SimASFF module also filters out contradictory information through weighted operations.
Efficiency Evaluation To verify the efficiency of the proposed method in solving image problems on other datasets, we also conducted comparative experiments on the PASCAL VOC 07+12 dataset, as shown in Table 5. Since some models in recent years only released the experimental results on the COCO 2017 dataset, the evaluation results in Table 5 are obtained in our experimental environment. From Table 5, we can find that our method can still produce better results. In addition, we chose 165 epochs as the final iteration in the table after 300 epochs of iteration for the three models is performed, as shown in Figure 8. It can be seen that 300 epochs of iteration are inefficient for this dataset, and the peak before data augmentation has been reached at 150 epochs.
Visual Comparison We demonstrate the differences in the results when the same image was processed by EYOLOX and two other detectors, as shown in Figure 9. Figure 9a and Figure 9c are the detection results of EYOLOX, Figure 9b is the detection result of YOLOX-s and Figure 9d is the detection result of YOLOv5-s. From these results, it can be seen that the confidence predicted by EYOLOX is higher and can detect small objects.

4. Conclusions

Object detection technology has gradually become an increasingly important tool in the computer field. In this paper, we proposed a novel detector EYOLOX for the task of object detection without FPN. For this detector, a new neck was designed to extract rich features for achieving higher performance that not only contains the improved reuse of existing modules but also transfers the application of the segmentation field to the detection field. Moreover, an efficient decoupled head was presented to effectively compensate the performance loss caused by the independent execution of classification and regression tasks. Additionally, a low-cost operation for non-local neural networks in self-attention computation was introduced to improve the calculation speed. The effectiveness of the proposed detector was verified by extensive experiments on public datasets. The comparative experimental results clearly demonstrated that EYOLOX exhibited higher detection accuracy. Furthermore, some ablation experiments have also carried out to verify the effectiveness of EYOLOX. However, due to the addition of new modules to the network structure, the number of model parameters increases to a certain extent and real-time detection performance is not satisfactory. Hence, in order to improve the practicability of the proposed method, optimizing and balancing the detection accuracy and speed is our future work. Meanwhile, we will also investigate some lightweight methods to reduce the parameters and promote detection speed and apply the method proposed in this paper as a plug-and-play module to other excellent detectors or fields.

Author Contributions

Data curation, R.T.; formal analysis, H.S.; funding acquisition, J.K.; investigation, D.L.; methodology, R.T.; resources, J.K.; writing—original draft, R.T. and M.Q.; writing—review and editing, M.Q. and conceptualization, H.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (62272096 and 62006174) and the Fund of Jilin Provincial Science and Technology Department (20210201077GX).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhou, L.; Huang, G.; Mao, Y.; Wang, S. Michael Kaess: EDPLVO: Efficient Direct Point-Line Visual Odometry. In Proceedings of the Internet Content Rating Association, Philadelphia, PA, USA, 23–27 May 2022; pp. 7559–7565. [Google Scholar]
  2. Wang, R.; Chen, D.; Wu, Z.; Chen, Y.; Dai, X.; Liu, M.; Jiang, Y.-G.; Zhou, L.; Yuan, L. BEVT: BERT Pretraining of Video Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14713–14723. [Google Scholar]
  3. Tan, S.; Wong, K.; Wang, S.; Manivasagam, S.; Ren, M.; Urtasun, R. Raquel Urtasun: SceneGen: Learning To Generate Realistic Traffic Scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 892–901. [Google Scholar]
  4. Chen, Y.; Rong, F.; Duggal, S.; Wang, S.; Yan, X.; Manivasagam, S.; Xue, S.; Yumer, E. Raquel Urtasun: GeoSim: Realistic Video Simulation via Geometry-Aware Composition for Self-Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7230–7240. [Google Scholar]
  5. Prakash, A.; Chitta, K.; Geiger, A. Multi-Modal Fusion Transformer for End-to-End Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7077–7087. [Google Scholar]
  6. Ding, C.; Pang, G.; Shen, C. Catching Both Gray and Black Swans: Open-set Supervised Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 7378–7388. [Google Scholar]
  7. Zaigham Zaheer, M.; Mahmood, A.; Haris Khan, M.; Segu, M.; Yu, F.; Lee, S.I. Generative Cooperative Learning for Unsupervised Video Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14724–14734. [Google Scholar]
  8. Dong, Q.; Cao, C.; Fu, Y. Incremental Transformer Structure Enhanced Image Inpainting with Masking Positional Encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11348–11358. [Google Scholar]
  9. Thatipelli, A.; Narayan, S.; Khan, S.; Anwer, R.M.; Khan, F.S.; Ghanem, B. Spatio-temporal Relation Modeling for Few-shot Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 19926–19935. [Google Scholar]
  10. Zadorozhnyi, Z.M.; Muravskyi, V.; Shevchuk, O.; Rusin, V.; Akimjaková, B.; Gažiová, M. Intelligent Behavioural Analysis of Social Network Data for the Purposes of Accounting and Control. In Proceedings of the 2022 12th International Conference on Advanced Computer Information Technologies (ACIT), Ruzomberok, Slovakia, 26–28 September 2022; pp. 276–280. [Google Scholar]
  11. Munro, J.; Damen, D. Multi-Modal Domain Adaptation for Fine-Grained Action Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 119–129. [Google Scholar]
  12. Ha, H.; Song, S. Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models. In Proceedings of the Conference on Robot Learning, Auckland, New Zealand, 14–18 December 2022. [Google Scholar]
  13. Gopinathan, M.; Truong, G.; Abu-Khalaf, J. Indoor Semantic Scene Understanding Using 2D-3D Fusion. In Proceedings of the 2021 Digital Image Computing: Techniques and Applications (DICTA), Gold Coast, Australia, 29 November–1 December 2021; pp. 1–8. [Google Scholar]
  14. Tosi, F.; Aleotti, F.; Ramirez, P.Z.; Poggi, M.; Salti, S.; Stefano, L.D.; Mattoccia, S. Distilled Semantics for Comprehensive Scene Understanding from Videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4653–4664. [Google Scholar]
  15. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
  16. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
  17. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  18. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  19. YOLOv5. 2021. Available online: https://github.com/ultralytics/yolov5 (accessed on 5 December 2022).
  20. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  21. Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 6517–6525. [Google Scholar]
  22. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
  23. Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Belongie: Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
  24. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar]
  25. Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. AugFPN: Improving Multi-Scale Feature Learning for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12592–12601. [Google Scholar]
  26. Ghiasi, G.; Lin, T.Y.; Le, Q.V. NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
  27. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
  28. Liu, S.; Huang, D.; Wang, Y. Learning spatial fusion for single-shot object detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
  29. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Network. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QU, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
  30. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
  31. Ross, B. Girshick. In Fast R-CNN. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  32. Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6568–6577. [Google Scholar]
  33. Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 765–781. [Google Scholar]
  34. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
  35. Chen, Q.; Wang, Y.; Yang, T.; Zhang, X.; Cheng, J.; Sun, J. You Only Look One-Level Feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13039–13048. [Google Scholar]
  36. Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar]
  37. Huang, Z.; Wang, X.; Huang, L.; Huang, C.; Wei, Y.; Liu, W. CCNet: Criss-Cross Attention for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 603–612. [Google Scholar]
  38. Huang, L.; Yuan, Y.; Guo, J.; Zhang, C.; Chen, X.; Wang, J. Interlaced Sparse Self-Attention for Segmentation. arXiv 2019, arXiv:1907.12273. [Google Scholar]
  39. Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar]
  40. Zhu, Z.; Xu, M.; Bai, S.; Huang, T.; Bai, X. Asymmetric Non-Local Non-Local Neural Networks for Semantica. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 593–602. [Google Scholar]
  41. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-Style ConvNets Great Again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 13733–13742. [Google Scholar]
  42. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. OLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
  43. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
  44. Xu, S.; Wang, X.; Lv, W.; Chang, Q.; Cui, C.; Deng, K.; Wang, G.; Dang, Q.; Wei, S.; Du, Y.; et al. PP-YOLOE: An evolved version of YOLO. arXiv 2022, arXiv:2203.16250. [Google Scholar]
  45. NVIDIA. TensorRT. 2018. Available online: https://developer.nvidia.com/tensorrt (accessed on 5 December 2022).
  46. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, US, 27–30 June 2016; pp. 770–778. [Google Scholar]
  47. Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar]
  48. Wang, C.Y.; Liao, H.Y.M.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of CNN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, US, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
  49. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  50. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Li, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]
  51. Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25 April 2022. [Google Scholar]
  52. Fu, J.; Liu, J.; Tian, H.; Li, Y.; Bao, Y.; Fang, Z.; Lu, H. Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3146–3154. [Google Scholar]
  53. Lee, Y.; Park, J. CenterMask: Real-Time Anchor-Free Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13903–13912. [Google Scholar]
  54. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  55. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  56. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond Empirical Risk Minimization. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  57. Zou, X.; Wu, Z.; Zhou, W.; Huang, J. YOLOX-PAI. An Improved YOLOX Version by PAI. arXiv 2022, arXiv:2208.13040. [Google Scholar]
  58. Sunkara, R.; Luo, T. No More Strided Convolutions or Pooling: A New CNN Building Block for Low-Resolution Images and Small Objects. arXiv 2022, arXiv:2208.03641. [Google Scholar]
Figure 1. Diagram of the network structure of the EYOLOX detector.
Figure 1. Diagram of the network structure of the EYOLOX detector.
Applsci 13 01506 g001
Figure 2. An illustration of the structure of CSPD Block. The features will be split into c1 and c2 in the channel dimension. Firstly, the feature represented by c1 will be passed into the Residual Dilated Block through the project layer to obtain a larger receptive field. Secondly, the feature represented by c2 will be used for calculating attention in the channel dimension and the spatial dimension, respectively, to focus on important information. Finally, the output features of the two parts are concatenated.
Figure 2. An illustration of the structure of CSPD Block. The features will be split into c1 and c2 in the channel dimension. Firstly, the feature represented by c1 will be passed into the Residual Dilated Block through the project layer to obtain a larger receptive field. Secondly, the feature represented by c2 will be used for calculating attention in the channel dimension and the spatial dimension, respectively, to focus on important information. Finally, the output features of the two parts are concatenated.
Applsci 13 01506 g002
Figure 3. The structure diagram of the Focus Self-Attention Block. In the figure, the four small rectangles with different colors are the feature values of the Query feature interval. The features of Key and Value are treated in the same way as the Query features.
Figure 3. The structure diagram of the Focus Self-Attention Block. In the figure, the four small rectangles with different colors are the feature values of the Query feature interval. The features of Key and Value are treated in the same way as the Query features.
Applsci 13 01506 g003
Figure 4. An illustration of the structure of CSPFSA Block. In the figure, the structure of CSPNet is still used to enrich the gradient information and reduce the amount of computation, where the upper part is the original structure of a common residual block, the features of the lower part perform the calculation of self-attention and channel attention.
Figure 4. An illustration of the structure of CSPFSA Block. In the figure, the structure of CSPNet is still used to enrich the gradient information and reduce the amount of computation, where the upper part is the original structure of a common residual block, the features of the lower part perform the calculation of self-attention and channel attention.
Applsci 13 01506 g004
Figure 5. An illustration of the structure of CSPFSA Block. In the figure, the structure of CSPNet is still used to enrich the gradient information and reduce the amount of computation, where the upper part is the original structure of a common residual block, the features of the lower part perform the calculation of self-attention and channel attention. An illustration of the structure of CSPSPPF comparison with the SPPCSPC structure. It can be seen that pooling layers of different sizes in parallel of SPPCSPC are replaced by pooling layers of the same size in series in CSPSPPF. The pink convolutional block represents 3 × 3 convolutional kernel with a stride of 1. The orange convolution block represents 1 × 1 convolution kernel with a stride of 1.
Figure 5. An illustration of the structure of CSPFSA Block. In the figure, the structure of CSPNet is still used to enrich the gradient information and reduce the amount of computation, where the upper part is the original structure of a common residual block, the features of the lower part perform the calculation of self-attention and channel attention. An illustration of the structure of CSPSPPF comparison with the SPPCSPC structure. It can be seen that pooling layers of different sizes in parallel of SPPCSPC are replaced by pooling layers of the same size in series in CSPSPPF. The pink convolutional block represents 3 × 3 convolutional kernel with a stride of 1. The orange convolution block represents 1 × 1 convolution kernel with a stride of 1.
Applsci 13 01506 g005
Figure 6. An illustration of the structure of ERDHead. In the figure, we add the features in the encoder in the form of residual edges to the features used for classification and regression after the 3 × 3 convolution operation.
Figure 6. An illustration of the structure of ERDHead. In the figure, we add the features in the encoder in the form of residual edges to the features used for classification and regression after the 3 × 3 convolution operation.
Applsci 13 01506 g006
Figure 7. Comparison of training AP changes for two data enhancement strategies applied to EYOLOX. (a) does not use MixUp and reduces the mosaic dithering range. (b) uses MixUp and increases the mosaic dithering range.
Figure 7. Comparison of training AP changes for two data enhancement strategies applied to EYOLOX. (a) does not use MixUp and reduces the mosaic dithering range. (b) uses MixUp and increases the mosaic dithering range.
Applsci 13 01506 g007
Figure 8. The AP comparison chart of EYOLOX training PASCAL VOC 07+12 on the validation set. The blue curve represents the result of iteration 165 epochs, and the red curve represents the result of iteration 300 epochs.
Figure 8. The AP comparison chart of EYOLOX training PASCAL VOC 07+12 on the validation set. The blue curve represents the result of iteration 165 epochs, and the red curve represents the result of iteration 300 epochs.
Applsci 13 01506 g008
Figure 9. Comparison of the prediction results between EYOLOX (a,c), YOLOX-s (b) and YOLOv5-s (d).
Figure 9. Comparison of the prediction results between EYOLOX (a,c), YOLOX-s (b) and YOLOv5-s (d).
Applsci 13 01506 g009aApplsci 13 01506 g009b
Table 1. Ablation experiments of EYOLOX in terms of AP (%) on COCO-2017 val. All the models are tested at 640 × 640 resolution, with FP16-precision and batch = 1 on a NVIDIA A40. The FPS in this table are post-processed.
Table 1. Ablation experiments of EYOLOX in terms of AP (%) on COCO-2017 val. All the models are tested at 640 × 640 resolution, with FP16-precision and batch = 1 on a NVIDIA A40. The FPS in this table are post-processed.
MethodsAP(%)AP50AP75APSAPMAPLParametersGFLOPsFPS
YOLOX-s [16]40.559.343.723.244.854.19M26.8G80
-PANet33.1(−7.4)50.236.113.535.944.16.1M21.3G95
+CSPD37.0(+3.9)55.039.316.239.757.58.3M23.5G88
+CSPFSA38.6(+1.6)55.941.317.341.658.318.4M37.5G68
+CSPSPPF39.3(+0.7)56.642.218.942.558.318.3M37.4G67
+SimASFF40.8(+1.5)60.343.822.544.255.915.1M35.2G55
+ERDHead41.4(+0.6)60.944.223.045.256.513.5M33.7G60
Table 2. Table of comparison experiments between the two operations. “A” represents the cheap operation, “B” represents the expensive operation and ESE is the channel attention mechanism module.
Table 2. Table of comparison experiments between the two operations. “A” represents the cheap operation, “B” represents the expensive operation and ESE is the channel attention mechanism module.
MethodsAP(%)AP50AP75APSAPMAPLParametersGFLOPsFPS
A41.360.844.122.845.056.513.52M33.73G61
B41.360.944.322.744.856.613.46M33.76G53
A+ESE41.460.944.223.045.356.513.54M33.74G60
B+ESE41.560.944.524.245.356.813.48M33.77G53
Table 3. Comparison of different augmentation strategies in EYOLOX. “Scale Jit.” stands for the range of scale jittering for mosaic image. The “+” stands for the expensive operation mentioned above.
Table 3. Comparison of different augmentation strategies in EYOLOX. “Scale Jit.” stands for the range of scale jittering for mosaic image. The “+” stands for the expensive operation mentioned above.
MethodsScale Jit.Extra AugAP (%)
EYOLOX[0.5–1.5]-41.4
EYOLOX[0.1–2.0]MixUp42.2 (+0.8)
EYOLOX +[0.5–1.5]-41.5
EYOLOX +[0.1–2.0]MixUp42.2 (+0.7)
Table 4. Comparison of the accuracy of different object detectors on COCO-2017 val. The methods marked with “*” are the results reproduced in other papers. Without the associated annotations, all our results are derived under 300 epochs.
Table 4. Comparison of the accuracy of different object detectors on COCO-2017 val. The methods marked with “*” are the results reproduced in other papers. Without the associated annotations, all our results are derived under 300 epochs.
MethodsSizeAP(%)valAP50AP75APSAPMAPLParameters
RetinaNet-R50 [15] *
RetinaNet-R101 [15] *
640
640
39.2
39.9
--
--
--
--
--
--
--
--
--
--
34M
53M
DETR-R50500e [34]64042.062.444.220.545.861.144M
YOLOF-R5072e [35]64041.660.545.022.446.257.644M
YOLOv5-s [19]64037.457.140.321.242.349.07.2M
YOLOv7-tiny-SiLU [43]64038.756.741.718.842.451.96.2M
EfficientDet-D1 [30]64039.658.642.317.944.356.06.6M
YOLOv5-SPD-s [58]64040.059.543.523.544.950.48.7M
YOLOv6-T300e [42]
YOLOV6-T400e [42]
640
640
40.3
41.1
--
57.5
--
44.3
--
21.0
--
45.9
--
58.1
15.0M
15.0M
PAI-YOLOX-s [57]64041.460.045.018.340.055.015.9M
YOLOX-s [16]64040.559.343.723.244.854.19.0M
EYOLOX64042.261.645.423.546.058.413.54M
Table 5. Comparison of the accuracy of different object detectors on PASCAL VOC 07+12 test. Without the associated annotations, all our results are derived under 165 epochs.
Table 5. Comparison of the accuracy of different object detectors on PASCAL VOC 07+12 test. Without the associated annotations, all our results are derived under 165 epochs.
MethodsSizeAP(%)testAP50AP75Parameters
YOLOv5-s [19]64055.377.863.37.2M
YOLOX-s [16]64059.379.565.49.0M
EYOLOX64062.581.569.313.54M
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tang, R.; Sun, H.; Liu, D.; Xu, H.; Qi, M.; Kong, J. EYOLOX: An Efficient One-Stage Object Detection Network Based on YOLOX. Appl. Sci. 2023, 13, 1506. https://doi.org/10.3390/app13031506

AMA Style

Tang R, Sun H, Liu D, Xu H, Qi M, Kong J. EYOLOX: An Efficient One-Stage Object Detection Network Based on YOLOX. Applied Sciences. 2023; 13(3):1506. https://doi.org/10.3390/app13031506

Chicago/Turabian Style

Tang, Rui, Hui Sun, Di Liu, Hui Xu, Miao Qi, and Jun Kong. 2023. "EYOLOX: An Efficient One-Stage Object Detection Network Based on YOLOX" Applied Sciences 13, no. 3: 1506. https://doi.org/10.3390/app13031506

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop