SV-FPN: Small Object Feature Enhancement and Variance-Guided RoI Fusion for Feature Pyramid Networks

Yang, Qianhui; Zhang, Changlun; Wang, Hengyou; He, Qiang; Huo, Lianzhi

doi:10.3390/electronics11132028

Open AccessArticle

SV-FPN: Small Object Feature Enhancement and Variance-Guided RoI Fusion for Feature Pyramid Networks

by

Qianhui Yang

¹,

Changlun Zhang

^1,*

,

Hengyou Wang

¹

,

Qiang He

¹ and

Lianzhi Huo

²

¹

School of Science, Beijing University of Civil Engineering and Architecture, Beijing 100044, China

²

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(13), 2028; https://doi.org/10.3390/electronics11132028

Submission received: 18 May 2022 / Revised: 19 June 2022 / Accepted: 23 June 2022 / Published: 28 June 2022

(This article belongs to the Special Issue Computer Vision Techniques: Theory and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Small object detection is one of the research difficulties in object detection, and Feature Pyramid Networks (FPN) is a common feature extractor in deep learning; thus, improving the results of small object detection based on FPN is of great significance in this field. In this paper, SV-FPN is proposed for a small object detection task, which consists of Small Object Feature Enhancement (SOFE) and Variance-guided Region of Interest Fusion (VRoIF). When using FPN as a feature extractor, an SOFE module is designed to enhance the finer-resolution level feature maps from which the small object features are extracted. VRoIF takes the variance of RoI features as the data driver to learn the completeness of several RoI features from different feature layers, which avoids wasting information and introducing noise. Ablation experiments on three public datasets (KITTI, PASCAL VOC 07+12 and MS COCO 2017) demonstrate the effectiveness of SV-FPN, and the mean Average Precision (mAP) of SV-FPN in the three datasets achieves 41.5%, 53.9% and 38.3%, respectively.

Keywords:

small object detection; FPN; feature enhancement; RoI feature fusion

1. Introduction

In recent years, with the advent of the era of big data, object detection based on deep learning has developed rapidly. Object detection is one of the basic tasks in computer vision, which is widely used in many areas, such as medical imaging, social security, smart homes, and transportation, etc. However, small object detection is a challenging task, and this is mainly because: (1) small objects occupy fewer pixels, which will result in indistinguishable features after passing through the convolutional neural networks (CNNs); (2) unbalanced distribution of small objects in the dataset will skew the network training process; (3) the setting of the network structure is not friendly to small objects. Therefore, many meaningful works are proposed.

Initially, some works were proposed for multi-scale object detection tasks. Lin et al. proposed Feature Pyramid Networks (FPN) for multi-scale detection [1]. Singh et al. optimized the candidate boxes on each layer of the image pyramid to improve the multi-scale detection accuracy [2]. Cao et al. used high-level features to enhance the semantic information of low-level features to improve multi-scale object detection [3]. Ren et al. proposed an anchor box, which could effectively detect objects of different sizes by designing different sized anchor boxes [4]. As technology develops, more and more works are focused on improving small object detection. Hu et al. proposed Context-Aware Region of Interest (RoI) Pooling to prevent small objects from being distorted after RoI Pooling [5]. Mate et al. proposed a small object data enhancement method by oversampling, copying and pasting small objects to expand the number and distribution of small objects in the training set [6]. Li et al. used a Perceptual Generative Adversarial Network (PGAN) to map small low-resolution objects into super-resolution large objects [7]. Chen et al. proposed a loss function feedback mechanism [8]. During the training process, the input is dynamically adjusted by the proportion of the small object in the loss function so that the network can optimize the object balance of each size. Bharat et al. proposed Scale Normalization for Image Pyramids (SNIP), which selectively back-propagates the gradients of object instances of different sizes as a function of the image scale [2]. Bharat et al. proposed SNIPER to save training overhead and speed up network convergence [9]. These works were carried out for small objects, which may sometimes affect medium and large objects.

Our contributions are as follows:

First, we propose SV-FPN to improve the result of small object detection, which would not affect the medium or large objects.

Second, we design an SOFE module between

C_{2}

and

P_{2}

in FPN to enhance small object features.

Third, we propose VRoIF, which dynamically fuses RoI features by the variance to enhance object features.

Fourth, some extensive empirical studies are performed to evaluate the effectiveness of the proposed method. The results show the effectiveness of SV-FPN for small object detection.

This paper is organized as follows: related concepts are introduced in Section 2. Section 3 explains the proposed works in detail. Subsequently, some experiments are implemented to prove the superior implementation performances of our works in Section 4. Finally, Section 5 concludes this work.

2. Related Works

2.1. Small Object Detection

To the best of our knowledge, small objects have smaller areas and fewer pixels than medium and large objects. There are two definitions of a small object; one is that the object area is less than 32 × 32 pixels [10], and the other is that the ratio of the object area to the image area is between 0.08% and 0.58% [11].

Object detection based on deep learning is usually performed by convolution-activation-pooling in the feature extraction stage, and pooling would down-sample the features, resulting in the loss of object features. Therefore, the feature information of small objects is less than medium and large objects, and less available information leads to poor detection results. Therefore, many works were proposed to improve small object detection from other aspects.

2.1.1. Data Augmentation

Data enhancement is widely used in object detection. CutOut [12], MixUp [13], CutMix [14], and Mosaic were often used to adjust the input of images to improve the robustness of the network [15]. Kisantal et al. proposed a small object data enhancement method to expand the number and distribution of small objects in the training set to enrich the diversity of the dataset, thereby improving the detection performance of small objects [6]. Yu et al. proposed a scale matching strategy, which cuts according to the size of the object to narrow the gap between objects of different sizes and avoids the situation where the information of small objects is easily lost in conventional scaling operations [16].

2.1.2. Multi-Scale Learning

In the shallow layer of CNNs, the receptive field is small, the semantic information is weak, and the context information is inferior, but the shallow layer has more spatial and detailed feature information. Liu et al. proposed SSD (Single Shot Multibox Detector), which used shallower features to detect smaller objects and deeper features to detect larger objects [17]. In order to save computing resources and obtain better feature fusion effects, Lin et al. proposed FPN (Feature Pyramid Network), which is the most popular multi-scale network at present [1]. The construction of FPN involves a bottom-up pathway, a top-down pathway, and lateral connections, and FPN achieves the purpose of feature enhancement by fusing the features of adjacent layers.

2.1.3. Contextual Learning

In real-world scenarios, there is usually a relationship between “object to scene” and “object to object”, and using this relationship will help improve the detection performance of small objects. Treating context utilization as an optimization problem, Barnea et al. discussed how much context or other types of additional information could improve detection scores and showed that simple co-occurrence relationships are the most effective context information [18]. Chen et al. proposed a hierarchical context embedding framework to enhance the feature representation of candidate regions by mining contextual cues [19]. Fu et al. modeled and inferred the inherent semantics and spatial layout relationship between objects and retained the spatial information as much as possible when extracting the semantic features of small objects [20]. Pato et al. proposed a rescoring algorithm based on contextual rescoring, which used RNNs (Recurrent Neural Networks) and self-attention mechanisms to transfer information between candidate regions and generate contextual representation, and used the obtained context information to perform a secondary evaluation of the detection results [21].

2.1.4. Generative Adversarial Learning

Generative Adversarial Learning aims to achieve the same detection performance as larger-sized objects by mapping the features of small low-resolution objects into features equivalent to those of high-resolution objects. Li et al. proposed Perceptual Generative Adversarial Network (PGAN) specifically for small object detection, which learns high-resolution feature representations of small objects by confronting the generator and discriminator with each other [7]. Bai et al. proposed a multi-task generative adversarial network (MTGAN) for small objects. In MTGAN, the generator is a super-resolution network that can upsample small blurred images into fine images and recover detailed information for more accurate detection [22]. Noh et al. proposed a new feature-level super-resolution method, which maintains the same receptive field between the generated high-resolution object features and the low-resolution features generated by the feature extractor through dilated convolution, avoiding the problem of generating false over-score features due to mismatched receptive fields [23].

2.1.5. Other Strategies

Although these methods mentioned above can effectively improve the performance of small object detection, the performance gains brought by these methods are often limited by the computational cost. There are many excellent works that have improved small object detection from other aspects. Wang et al. proposed a guided anchoring strategy based on semantic features, which improved the performance of small object detection by simultaneously predicting the possible location of the object center and the scale and aspect ratio of the object [24]. Sun et al. proposed a multi-receptive field and small-object-focused weakly supervised segmentation network by using multiple receptive field blocks to focus on the object and its adjacent background and set weights according to different spatial positions to achieve enhanced feature availability [25].

2.2. Feature Pyramid Networks

The design goal of the FPN is to use the pyramid feature generated by CNNs to construct a feature pyramid with multi-resolution and use the feature pyramid for multi-scale object detection. FPN is currently the most popular multi-scale network [1]. Figure 1 shows the structure of FPN, where {

C_{2}

,

C_{3}

,

C_{4}

,

C_{5}

} is a bottom-up path and {

P_{2}

,

P_{3}

,

P_{4}

,

P_{5}

} is a top-down path.

To improve detection performance, there are many meaningful works developed on FPN [1]. In order to facilitate the growth of information flow in the recommended region-based instance segmentation framework, Liu et al. proposed PANet [26]. By analyzing the defects of FPN, AugFPN proposed targeted improvement plans, which were Consistent Supervision, Residual Feature Augmentation, and Soft RoI Selection [27]. NAS-FPN proposed merging cells to re-integrate the features extracted by FPN for better detection accuracy [28]. Auto-FPN used the search network to automatically find the most suitable connection between different layers for better features [29].

A^{2}

-FPN used attention-guided feature aggregation to improve multi-scale feature learning [30].

Although there are many works based on FPN, they generally focus on the defects of the network and do not focus on improving small object detection based on FPN.

2.3. Preliminaries

In FPN, the image is down-sampled by CNNs to obtain

C_{5}

, and the feature maps {

P_{4}

,

P_{3}

,

P_{2}

} are generated by the up-sampling of

P_{5}

. The bottom-up feature map has accurately localized information because they are subsampled fewer times, so the feature maps are generated by the up-sampling of

P_{5}

and would fuse with the bottom-up pathway to enhance their position information. However, the dimension (the number of channel) of {

C_{2}

,

C_{3}

,

C_{4}

,

C_{5}

} is {256, 512, 1024, 2048}, and the dimension of {

P_{2}

,

P_{3}

,

P_{4}

,

P_{5}

} is {256, 256, 256, 256}; therefore, the 1 × 1 convolution is used to reduce the dimension of {

C_{2}

,

C_{3}

,

C_{4}

,

C_{5}

} before the fusion of the two pathways.

RoI feature extraction is a research hotspot. Anchor-Base object detection uses RoI Pooling or RoI Align to obtain RoI features of the final feature map. When using FPN as a feature extractor, there are four feature maps, so various RoI feature extraction strategies are proposed. RoI feature extraction is based on two research directions: one is area assign, which is allocated RoI to a certain layer of FPN to extract features according to the area of RoI; the other is feature fusion, which combines the RoI features from {

P_{2}

,

P_{3}

,

P_{4}

,

P_{5}

}, then fuses them in some way.

FPN proposed that different scales of RoIs would be assigned to different feature pyramid levels decided by Equation (1) [1]:

k = ⌊k_{0} + {log}_{2} (\sqrt{w h} / 224)⌋

(1)

Here, k is the assigned feature level in FPN, so the range of k is [2, 5],

k_{0}

is the assigned result when the area of the RoI is 224 × 224, w and h represent the length and width of the RoI, 224 is the canonical input size of ImageNet. It can be seen that the small objects will acquire features from

P_{2}

in FPN. However, 224 is a hard-coded, which cannot adapt to input image scale variation, so Lee et al. proposed using the relative area of RoI in the image to replace the absolute area in Equation (1), and a new RoI assignment function is defined as Equation (2) [31]:

k = ⌈k_{m a x} - {log}_{2} (A_{i n p u t} / A_{R o I})⌉

(2)

Here, k is the assigned feature level in FPN,

k_{m a x}

represents the high level of FPN,

A_{R o I}

represents the area of the object in the input image, and

A_{i n p u t}

represents the area of the image. However, RoI features obtained from one certain feature level may not be the optimal method, which may ignore useful information from other feature levels in FPN.

Considering that the importance of features may not be strongly correlated to the pyramid levels they belong to, PANet proposed adaptive feature pooling, which pooling feature from {

P_{2}

,

P_{3}

,

P_{4}

,

P_{5}

} and fused them with maximum operation [26]. Because the maximum operation may ignore some useful information and the fully connected layer in adaptive feature pooling resulted in additional parameters, AugFPN proposed adaptive spatial fusion, which used the result of global average pooling as the channel initial attention weight, and then the learned weight was applied to the features correspondingly; finally, four features were added pixel-by-pixel [27]. PANet and AugFPN considered fusing the feature to make full use of features but did not consider whether the fusion method is suitable for this task. In addition, the excessive difference between the input and output size of the RoI Align will result in perturbed features.

Therefore, this paper proposes VRoIF. Firstly, the variance of RoI features is used to replace the max operation or average pooling of the previous method to obtain the initial weight of the features on {

P_{2}

,

P_{3}

,

P_{4}

,

P_{5}

}. At the same time, in order to ensure the integrity of the RoI feature, the result of the area assigned would be weighted and added to the weighted features.

3. Materials and Methods

3.1. The Overall Structure of SV-FPN

Figure 2 shows the structure of SV-FPN. Small Object Features Enhancement (SOFE) module and Variance-guided RoI Fusion (VRoIF) were conducted in FPN to improve the result of small objects. The SOFE module was added to lateral connections between

C_{2}

and

P_{2}

, which was designed to enhance

P_{2}

, from which small object features are extracted. We propose VRoIF to allow each RoI to use variance as weight initialization to fuse its feature from different feature levels in FPN, and the result of area assignment is weight-added to ensure feature integrity.

3.2. Small Object Features Enhancement

Convolution extracts certain specified features of an image, such as textures, gradients, etc. There are many different-sizes of convolution kernels. A larger convolution kernel has a larger receptive field and can obtain more comprehensive features. The role of 1 × 1 convolution: (1) enable cross-channel interaction and information integration; (2) increase nonlinear features; (3) reduce model parameters and calculation. It is worth noting that stacking multiple small convolution kernels can achieve the receptive field of a large convolution kernel with more non-linearity and less parameter computation.

Small object features are extracted from

P_{2}

in FPN, which is obtained by adding

C_{2}

and

P_{3}

.

C_{2}

has accurate localization and simple semantic information, but the lack of deep convolution would bring noise to

P_{2}

. Therefore, it is necessary to design a deep convolution block for

C_{2}

to improve its abstract semantic information and reduce detailed information.

We propose a Small Object Feature Enhancement (SOFE) module, which enhances the small object features based on FPN, and the flowchart of SOFE is shown in Figure 3. As illustrated in Figure 4, a SOFE module consists of 9 CBR modules, which are divided into 3 branches. The CBR module consists of three parts, namely CONV, BN and ReLU, and CBRn represents the size of the convolution kernel as n × n. The convolution kernels in the second CBR module of each branch are designed to different sizes: the 1 × 1 convolution branch is used to obtain inter channel correlation information and retain small object features, the 3 × 3 convolution branch is used to extract features, and the 5 × 5 convolution branch is used to capture the context information around the object so that the context information is helpful for detection. And the pseudocode of SOFE is shown in Algorithm 1. It is worth noting that in order to reduce model complexity and training time, the convolution kernel in the first and third CBR modules of each branch is set to 1 × 1, and the 5 × 5 convolution kernel here is replaced by two 3 × 3 convolution kernels.

By adding the SOFE module to FPN, with deep convolution to extract abstract features and positional information,

P_{2}

is much more friendly to small object detection.

Algorithm 1 Small Object Feature Enhancement.

Input: feature map

C_{2}

Output: enhanced

C_{2}

1:: Enhance $C_{2}$ :
2:: $F_{1} \leftarrow C B R_{1} (C B R_{1} (C B R_{1} (C_{2})))$
3:: $F_{2} \leftarrow C B R_{1} (C B R_{3} (C B R_{1} (C_{2})))$
4:: $F_{3} \leftarrow C B R_{1} (C B R_{5} (C B R_{1} (C_{2})))$
5:: Feature fusion:
6:: $F \leftarrow F_{1} + F_{2} + F_{3}$
7:: Convolution:
8:: $F \leftarrow C o n v (F)$
9:: Residual connection:
10:: Enhanced $C_{2} \leftarrow F + C_{2}$

3.3. Variance-Guided RoI Fusion

For the first RoI assignment method, the correspondence of RoIs and FPN levels depends on Equation (1). Table 1 presents the range of RoI area assigned to {

P_{2}

,

P_{3}

,

P_{4}

,

P_{5}

}, where

A_{R o I}

represents the area of an object in an input image, and

A_{R o I}^{^{'}}

represents the area of an object in feature maps.

The output size of RoI Align is usually 7 × 7 or 14 × 14. It can be seen from the last column of Table 1 that the relative size of the RoI assigned to each FPN layer of {

P_{2}

,

P_{3}

,

P_{4}

,

P_{5}

} is close to the output size of RoI Align, which provides a complete feature.

For the other RoI assignment method, small objects fuse the features from surrounding objects in the low-resolution feature map, which have strong semantic features, and max pooling or average pooling as the weight initialization would lead to larger feature weights when the feature is fused. Large objects in the high- resolution feature map would be lost and result in incomplete features.

Parameter initialization has a great impact on the model. A good parameter initialization can speed up the convergence of the network and improve the generalization ability of the network. The result of max pooling or average pooling as initial parameters may not lead to a better result. Therefore, we want to find an observation to measure the completeness and richness of the feature and use the result of the observation as the initial parameters.

In statistical descriptions, variance is used to calculate the difference between each observation and the population mean. Abstract object features should have obvious changes to provide more information for detection tasks, so it is worth thinking about using variance as an observation to measure the completeness of the RoI feature.

In order to verify the relationship between the variance of object features and the detection results, we applied variance statistics to the object features of the network with different mean Average Precisions (mAP). Taking Faster R-CNN, Mask R-CNN and Cascade R-CNN as experimental networks and using ResNet50 and ResNet101 as backbone networks, the statistical results are shown in Table 2.

In Table 2, 1× is the result of 12 training epochs, and 2× is the result of 24 training epochs. The first column under ResNet50 and ResNet101 is the mAP, and the second column is the mean variance of the regression box features in the validation set of MS COCO 2017. It can be seen from Table 2 that in the same network structure, the longer the training, the higher the accuracy, and the larger the feature variance of the regression box. Therefore, the feature map with large variance is more friendly to object detection.

Based on the above analysis, variance can effectively describe the completeness and richness of the feature. Therefore, we propose a Variance-guided RoI Fusion (VRoIF) module, which is shown in Figure 5 and the flowchart of VRoIF is shown in Figure 6. VRoIF uses the variance of RoI features as a data-driver to obtain the weight of RoI features in {

P_{2}

,

P_{3}

,

P_{4}

,

P_{5}

}, and the weighted RoI features from {

P_{2}

,

P_{3}

,

P_{4}

,

P_{5}

} are fused to obtain

F_{v a r}

. The calculation of variance in feature maps is shown in Equation (3), where x is the pixel sample value of an RoI feature,

\bar{x}

is the average value of pixel sample in the RoI features, and n is the number of pixel samples.

S^{2} = \frac{\sum_{i = 1}^{n} {(x - \bar{x})}^{2}}{n - 1}

(3)

Considering the size difference of the input and output of RoI Align,

F_{v a r}

is easily disturbed by adjacent objects, so the result of area assignment,

F_{a r e a}

, will be added to

F_{v a r}

. For each RoI, firstly, VRoIF extracts RoI features from {

P_{2}

,

P_{3}

,

P_{4}

,

P_{5}

}, and the area assignment result

F_{a r e a}

is obtained by Equation (1). Secondly, the variance of RoI features is calculated, which is used as the original data to drive the learning weights of the RoI features. Thirdly, the learned weights are multiplied with the corresponding RoI features, and then a convolution is conducted to fuse RoI features. Finally,

F_{v a r}

and

F_{a r e a}

are weight-added to obtain the RoI features

F_{R o I}

by Equation (4):

F_{R o I} = α F_{a r e a} + (1 - α) F_{v a r}

(4)

Here,

α

is a hyper-parameter in the range of (0, 1), and the detection result is optimal when it is 0.5. And the pseudocode of VRoIF is shown in Algorithm 2.

Algorithm 2 Variance-guided RoI Fusion.

Input: feature map

C_{2}

Output: enhanced

C_{2}

1:: $k \leftarrow ⌊k_{0} + {log}_{2} (\sqrt{w h} / 224)⌋$
2:: $F_{a r e a}$ is RoI feature extracted from $P_{k}$
3:: $F \leftarrow$ concatenating RoI features extracted from { $P_{2}$ , $P_{3}$ , $P_{4}$ , $P_{5}$ }
4:: $W e i g h t \leftarrow C o n v (C o n v (V a r (F)))$
5:: $F_{v a r} \leftarrow C o n v (W e i g h t \cdot F)$
6:: $F_{R o I} \leftarrow α F_{a r e a} + (1 - α) F_{v a r}$

4. Experiments

4.1. Datasets and Evaluation Metrics

In the experiments, we used the public datasets: KITTI [32], PASCAL VOC 07+12 [33] and MS COCO 2017. KITTI consists of more than 7K images, of which 3712 are in the training set and 3769 are in the validation set, with a total of 7 categories. PASCAL VOC 07+12 includes around 16K images, of which 4952 images are used for validation, with a total of 20 categories. MS COCO 2017 contains 118K images for training, 5K images for validation, with a total of 80 categories.

In the KITTI dataset, we used the standard evaluation metric proposed in PASCAL Visual Objects Classes: mAP, and the IoU threshold is set to 0.5. In PASCAL VOC 07+12 and MS COCO 2017, we follow the standard evaluation metric proposed in MS COCO: mAP, AP

_{50}

, AP

_{75}

, AP

_{S}

, AP

_{M}

and AP

_{L}

, mAP is over 10 IoU thresholds (IoU = 0.50, 0.55, …, 0.90, 0.95) to calculate, AP

_{50}

and AP

_{75}

represent the AP value when the IoU threshold is 0.50 and 0.75, AP

_{S}

, AP

_{M}

and AP

_{L}

can clearly describe the detection performance with different object scales.

4.2. Implementations Details

All experiments were implemented based on MMDetection [34] and exploited with Ubuntu18.04, PyTorch v1.6.0, CUDA 10.1 and CUDNN 7.1.4. We trained models with 1 GPU (2 images per GPU) for 12 epochs. The selected baseline network is Faster R-CNN, and the initial learning rate was set as 0.0025 and decreased by a ratio of 0.1 after the 8th and 11th epoch, respectively. All other hyper-parameters in this paper remain the same as MMDetection.

4.3. Experiment Results and Analysis

4.3.1. Weight Fusion Operation Selection

Max pooling and average pooling are widely used in deep learning for fusion operation. PANet proposed adaptive feature pooling, which used max pooling as the RoI feature fusion method, and AugFPN proposed Soft RoI Selection, which used average pooling as the initial weight of RoI feature fusion. However, max pooling and average pooling are not suitable for all feature fusion situations. Therefore, in order to verify the superiority of variance-guided methods, we take max pooling, average pooling and variance as weight initialization to conduct experiments on the PASCAL VOC 07+12. The experimental results are shown in Table 3.

As can be seen from Table 3, except for AR

_{L}

, the rest of the evaluation metrics for the model initialized with variance-guided fusion outperforms max pooling and average pooling. In small object detection, the performance of variance-guided fusion is more eye-catching: compared with average pooling, AP

_{S}

is increased by 4.1%, and AR

_{S}

is increased by 2.6%; compared with max pooling, AP

_{S}

is increased from 24.7% to 25.0%, AR

_{S}

is increased from 33.4% to 37.3%. AP is an indicator that is used to measure the regression quality of the prediction box, and AR is an indicator to measure the number of detected objects. Therefore, the experimental data indicate that when using the variance-guided option in RoI feature fusion, not only the regression quality of the predict box would be improved but also much smaller objects would be detected. It also shows that the weights based on variance-guided fusion are better than max pooling and average pooling.

4.3.2. Proportional Statistics

In order to analyze the proportion of RoI features of different sizes from different feature layers in FPN, we count the sources of

F_{v a r}

. Figure 7 shows that RoIs with different sizes all exploit features from several different levels with variance-guided fusion. The blue polyline represents small objects that were assigned to

P_{2}

originally in FPN, and nearly 48% of the features are from

P_{2}

with variance-guided fusion, which explains the integrity of features is useful for detection. The green polyline and red polyline represent large objects that were assigned to

P_{4}

and

P_{5}

originally in FPN, and it can be clearly seen that their peaks are slightly higher than those of the other two polylines, which indicates that

P_{4}

and

P_{5}

have stronger semantic information. The peaks of the four polylines all appear in the result of Equation (1), which is consistent with our above analysis.

4.3.3. Hyper-Parameter

In order to explore the influence of

α

in Equation (4) on the detection accuracy, we set

α

to {0, 0.1, 0.2, …, 0.8, 0.9} to conduct experiments on the PASCAL VOC 07+12 to find the optimal solution. The results are shown in Table 4. When

α

is 0.5, mAP reaches the optimal value, and when

α

is 0.6, AP

_{S}

reaches the optimal value. It is worth noting that the result of non-zero is better than zero, which verifies that the result of the area assignment is beneficial to detection.

4.3.4. Component Ablation Studies

In order to verify the effectiveness of SV-FPN, some ablation experiments are performed in this section. We first take Faster R-CNN + FPN as the baseline experiment because the two modules in SV-FPN are proposed based on FPN. Then, we replace the lateral connection between

P_{2}

and

C_{2}

with the SOFE module to enhance all of the small object features. At the same time, all of the medium objects and a few large objects feature are enhanced. After that, VRoIF is added to fuse RoI features from different FPN levels, which could capture richer RoI features for multi-scale objects. Finally, SV-FPN, which combines the SOFE module and VRoIF, is used as the backbone of the detection networks. Table 5, Table 6 and Table 7 present the ablation experiment results on the public datasets KITTI, PASCAL VOC 07+12 and MS COCO 2017, respectively.

Objects in the KITTI dataset are mostly medium and small objects. It can be seen from Table 5 that after adding the SOFE module, the AP of most categories is increased in the range of 0.3–7.4%, except for the van, and mAR is increased by 3.2%; after adding the VRoIF module, in addition to the slight decrease in the AP of the car, a few objects are the same as the baseline networks, the rest of the categories have been improved to varying degrees, the AP of trams has increased by 8.3% especially, and mAR has been increased from 56% to 58.7%. Last but not least, when adding two modules, that is, when using SV-FPN as the feature extractor, AP and AR were greatly improved—from 38.8% to 41.5%, and from 56% to 60.6%, respectively. The improvement of AP means that the predicted boxes are more accurate, and classification accuracy would be improved; the improvement of mAR means that more objects are detected. Therefore, SV-FPN can ensure the improvement in detection accuracy and reduction in the missed detection rate.

It can be seen from Table 6 that the SOFE module significantly improves the accuracy of small objects on PASCAL VOC 07+12. AP

_{S}

is increased from 22.4% to 24.4%, and mAP is increased by 0.6%, which reveals that the SOFE module has a positive impact on small object detection. In addition, the VRoIF module improves AP

_{S}

by 4.1%, AR

_{S}

by 3.0%, and mAR by 0.7%, which verifies that the VRoIF module can result in significant performance improvements for small objects. When both the SOFE module and VRoIF module are added to the network, mAP reaches the optimal value of 53.9%, and mAR also reaches 63.8%, which is 1.7% higher than that of the baseline network. However, the small object detection result of SV-FPN is worse than that of adding only one module.

It can be seen from Table 7 that the performance of the SOFE module and VRoIF module on MS COCO 2017 is consistent with the performance on PASCAL VOC 07+12: the two modules significantly improved AP

_{S}

, which has increased by 0.8% and 1.0%, respectively, and AR

_{S}

improved too. The two modules also have a corresponding improvement in the detection ability of the network, increasing mAP to 38.3%, but AP

_{S}

is also worse than adding only one module.

From the above experimental results, it can be seen that VRoIF has improved multi-scale object detection results on three public datasets, which proves that VRoIF is better than the baseline method and effectively improved the multi-scale object prediction accuracy. SV-FPN, which embeds the SOFE module and VRoIF into FPN, achieved the best detection accuracy in all ablation experiments on the three datasets: the baseline network mAP is improved by 2.7%, 1.6% and 0.8%, respectively, proving that SV-FPN not only improves the small objects but also improves the detection effect of the medium and the large objects to a certain extent.

In addition, Figure 8 shows the mAP trend of FPN and SV-FPN during the training process. It was found that the mAP of SV-FPN outperforms FPN at the beginning of training, and the leading advantage is obvious. It is only when the training is relatively stable that the mAP of the two modules is not very different. However, after the decay of the learning rate, the mAP of SV-FPN is still significantly better than that of FPN, and the leading trend gradually stabilizes during the network convergence.

The ablation experiment results prove that the improvement in the SOFE module is mainly attributed to the small object detection, while VRoIF is not limited to small objects, and the multi-scale object detection results also improved. However, when the two components are used simultaneously, SV-FPN achieves the best performance, of which AP

_{S}

is worse than when only using the SOFE module or VRoIF. This is mainly because the position of the two proposed components is close in the detection network, which will cause feature disturbance for small object features.

4.3.5. Example Display

For a fair observation and to verify the improvement of SV-FPN in small object detection, we visualized the detection results on the validation sets of three public datasets: KITTI, PASCAL VOC 07+12 and MS COCO 2017, and selected some images to show the differences between the baseline and SV-FPN in different scenarios.

The visualization results of the three datasets are shown in Figure 9, Figure 10 and Figure 11. The first row is the visual results of the baseline, and the last row is the visual results of Faster R-CNN + SV-FPN. In addition, there is a comparison of the results between the first row and the last row.

As shown in Figure 9, in the traffic scene image, it can be clearly seen that compared with the baseline detection results, SV-FPN can accurately detect small objects that are difficult to identify not only in the distance but also close by. The SOFE module and VRoIF in this scene are beneficial to improve the detection effects of small objects.

As shown in Figure 10 and Figure 11, in the multi-scene images of PASCAL VOC 07+12 and MS COCO 2017, SV-FPN significantly improved the detection results of all kinds of objects, such as person, cars, etc. Compared with FPN, the visual effect of SV-FPN is better. Although the SOFE module and VRoIF improve small object detection performance from different angles, the visual results show that they could effectively detect small objects, which are ignored by the baseline. The above visualization results verify the effectiveness of SV-FPN.

5. Conclusions

In this paper, we propose SV-FPN to improve the detection accuracy of small objects. Firstly, we design the SOFE module to enhance small object features, which is placed in the lateral connections between

P_{2}

and

C_{2}

. After that, in order to enhance the RoI features of the small objects, VRoIF uses variance as the data driver to adaptively weight the RoI features from different FPN layers to obtain the weighted RoI feature map

F_{v a r}

, and the result of the area allocation method

F_{a r e a}

will be weight-added to ensure the integrity of the object features. Some experiments on KITTI, PASCAL VOC 07+12 and MS COCO 2017 show that SV-FPN improves the detection accuracy and visual results of small objects.

Author Contributions

Conceptualization, Q.Y. and C.Z.; methodology, Q.Y.; software, Q.Y.; validation, Q.Y., C.Z. and H.W.; formal analysis, Q.Y. and C.Z.; writing—original draft preparation, Q.Y., C.Z. and H.W.; writing—review and editing, C.Z., Q.H. and L.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (nos. 62072024, 41971396), R&D Program of Beijing Municipal Education Commission (KM202210016002), the Projects of Beijing Advanced Innovation Center for Future Urban Design (no. UDC2019033324, UDC2017033322), and the Fundamental Research Funds for Municipal Universities of Beijing University of Civil Engineering and Architecture (nos. X20084, ZF17061).

Conflicts of Interest

The authors declare no conflict of interest.

References

Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Singh, B.; Davis, L.S. An analysis of scale invariance in object detection snip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3578–3587. [Google Scholar]
Cao, G.; Xie, X.; Yang, W.; Liao, Q.; Shi, G.; Wu, J. Feature-Fused SSD: Fast detection for small objects. In Proceedings of the Ninth International Conference on Graphic and Image Processing (ICGIP 2017), International Society for Optics and Photonics, Qingdao, China, 14–16 October 2018; Volume 10615, p. 106151E. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef] [PubMed]
Park, H.; Sjosund, L.; Yoo, Y.; Monet, N.; Bang, J.; Kwak, N. Sinet: Extreme lightweight portrait segmentation networks with spatial squeeze module and information blocking decoder. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–5 March 2020; pp. 2066–2074. [Google Scholar]
Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar]
Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual generative adversarial networks for small object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1222–1230. [Google Scholar]
Chen, Y.; Zhang, P.; Li, Z.; Li, Y.; Zhang, X.; Qi, L.; Sun, J.; Jia, J. Dynamic Scale Training for Object Detection. arXiv 2020, arXiv:2004.12432. [Google Scholar]
Singh, B.; Najibi, M.; Davis, L.S. Sniper: Efficient multi-scale training. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Chen, C.; Liu, M.Y.; Tuzel, O.; Xiao, J. R-CNN for small object detection. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; pp. 214–230. [Google Scholar]
DeVries, T.; Taylor, G.W. Improved regularization of convolutional neural networks with cutout. arXiv 2017, arXiv:1708.04552. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
Yun, S.; Han, D.; Oh, S.J.; Chun, S.; Choe, J.; Yoo, Y. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6023–6032. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale match for tiny person detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 4–8 January 2020; pp. 1257–1265. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Barnea, E.; Ben-Shahar, O. Exploring the bounds of the utility of context for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7412–7420. [Google Scholar]
Chen, Z.M.; Jin, X.; Zhao, B.; Wei, X.S.; Guo, Y. Hierarchical context embedding for region-based object detection. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 633–648. [Google Scholar]
Fu, K.; Li, J.; Ma, L.; Mu, K.; Tian, Y. Intrinsic relationship reasoning for small object detection. arXiv 2020, arXiv:2009.00833. [Google Scholar]
Pato, L.V.; Negrinho, R.; Aguiar, P.M. Seeing without looking: Contextual rescoring of object detections for ap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 14610–14618. [Google Scholar]
Bai, Y.; Zhang, Y.; Ding, M.; Ghanem, B. Sod-mtgan: Small object detection via multi-task generative adversarial network. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 206–221. [Google Scholar]
Noh, J.; Bae, W.; Lee, W.; Seo, J.; Kim, G. Better to follow, follow to be better: Towards precise supervision of feature super-resolution for small object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 9725–9734. [Google Scholar]
Wang, J.; Chen, K.; Yang, S.; Loy, C.C.; Lin, D. Region proposal by guided anchoring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Korea, 27 October–2 November 2019; pp. 2965–2974. [Google Scholar]
Sun, S.; Yin, Y.; Wang, X.; Xu, D.; Zhao, Y.; Shen, H. Multiple receptive fields and small-object-focusing weakly-supervised segmentation network for fast object detection. arXiv 2019, arXiv:1904.12619. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12595–12604. [Google Scholar]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seoul, Korea, 27 October–2 November 2019; pp. 7036–7045. [Google Scholar]
Xu, H.; Yao, L.; Zhang, W.; Liang, X.; Li, Z. Auto-fpn: Automatic network architecture adaptation for object detection beyond classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6649–6658. [Google Scholar]
Hu, M.; Li, Y.; Fang, L.; Wang, S. A2-FPN: Attention aggregation based feature pyramid network for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15343–15352. [Google Scholar]
Lee, Y.; Park, J. Centermask: Real-time anchor-free instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13906–13915. [Google Scholar]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef] [Green Version]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. MMDetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]

Figure 1. Illustration of FPN, with predictions made independently at all levels. {

C_{2}

,

C_{3}

,

C_{4}

,

C_{5}

} is a bottom-up path and {

P_{2}

,

P_{3}

,

P_{4}

,

P_{5}

} is a top-down path, 1 × 1 convolution is a lateral connection.

Figure 1. Illustration of FPN, with predictions made independently at all levels. {

C_{2}

,

C_{3}

,

C_{4}

,

C_{5}

} is a bottom-up path and {

P_{2}

,

P_{3}

,

P_{4}

,

P_{5}

} is a top-down path, 1 × 1 convolution is a lateral connection.

Figure 2. Illustration of SV-FPN. (a) Small object feature enhancement module. (b) Variance-guided RoI Fusion.

Figure 3. The flowchart of SOFE.

Figure 4. (a) The structure of the SOFE module. (b) The details of the eSOFE module.

Figure 5. Illustration of VRoIF.

F_{a r e a}

is the result of area assignment. Var is the variance of each channel of concatenating RoI features. Weight represents the completeness of each channel of concatenating RoI features.

F_{v a r}

is the weighted RoI features.

F_{R o I}

is the result of VRoIF.

Figure 5. Illustration of VRoIF.

F_{a r e a}

is the result of area assignment. Var is the variance of each channel of concatenating RoI features. Weight represents the completeness of each channel of concatenating RoI features.

F_{v a r}

is the weighted RoI features.

F_{R o I}

is the result of VRoIF.

Figure 6. The flowchart of VRoIF.

Figure 7. Ratio of RoI features from different feature levels. Each line represents a set of proposals that should be assigned to the same feature level in FPN, i.e., proposals with similar scales. The horizontal axis denotes the ratio of features. It shows that RoIs with different sizes all exploit features from several different levels.

Figure 8. Accuracy analysis of the proposed SV-FPN and FPN on COCO. The blue polyline is SV-FPN, and the orange polyline is FPN.

Figure 9. Comparison of visual results between our model and FPN on the KITTI dataset.

Figure 10. Comparison of visual results between our model and FPN on the PASCAL VOC 07+12 dataset.

Figure 11. Comparison of visual results between our model and FPN on the MS COCO 2017 dataset.

Table 1. The area range of RoI assignment. ‘Related to input image’ represents the RoI area in image, and ‘Related to FPN level feature maps’ represents the RoI area in feature maps.

FPN Levels	Related to Input Image	Related to Feature Maps
$P_{5}$	$A_{R o I}$ ⩾ $448^{2}$	$A_{R o I}^{^{'}}$ ⩾ $14^{2}$
$P_{4}$	$224^{2}$ ⩽ $A_{R o I}$ < $448^{2}$	$14^{2}$ ⩽ $A_{R o I}^{^{'}}$ < $28^{2}$
$P_{3}$	$112^{2}$ ⩽ $A_{R o I}$ < $224^{2}$	$14^{2}$ ⩽ $A_{R o I}^{^{'}}$ < $28^{2}$
$P_{2}$	$A_{R o I}$ < $112^{2}$	$A_{R o I}^{^{'}}$ < $28^{2}$

Table 2. Statistical results of object variance. Explore the relationship between the variance of object features and mAP.

Network	Faster R-CNN				Mask R-CNN				Cascade R-CNN
Backbone	ResNet50		ResNet101		ResNet50		ResNet101		ResNet50		ResNet101
1×	0.374	1.016	0.394	1.109	0.382	1.444	0.400	1.571	0.403	2.079	0.420	2.311
2×	0.384	1.739	0.398	1.855	0.392	2.685	0.408	2.931	0.410	2.877	0.425	3.039

Table 3. Effect of weight initialization of RoI feature fusion. Detection performance is relatively robust to weight initialization of RoI feature fusion.

	mAP	AP $_{50}$	AP $_{75}$	AP $_{S}$	AP $_{M}$	AP $_{L}$	mAR	AR $_{S}$	AR $_{M}$	AR $_{L}$
Soft RoI Selection (average pooling)	0.526	0.837	0.574	0.209	0.389	0.574	0.621	0.347	0.490	0.667
Adaptive Feature Pooling (max pooling)	0.529	0.836	0.576	0.247	0.387	0.578	0.626	0.334	0.491	0.672
Variance-guided	0.532	0.842	0.585	0.250	0.394	0.579	0.626	0.373	0.497	0.670

Table 4. Effect of the value of

α

. The proportion of

F_{a r e a}

has an impact on the detection.

Table 4. Effect of the value of

α

. The proportion of

F_{a r e a}

has an impact on the detection.

$α$	0	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
mAP	0.523	0.528	0.526	0.524	0.530	0.532	0.530	0.527	0.527	0.526
AP $_{50}$	0.835	0.836	0.841	0.838	0.840	0.842	0.843	0.838	0.841	0.838
AP $_{75}$	0.574	0.576	0.568	0.570	0.580	0.585	0.577	0.577	0.576	0.575
AP $_{S}$	0.224	0.261	0.222	0.249	0.230	0.250	0.265	0.260	0.242	0.246
AP $_{M}$	0.387	0.390	0.395	0.387	0.386	0.394	0.401	0.395	0.400	0.393
AP $_{L}$	0.571	0.576	0.574	0.570	0.578	0.579	0.576	0.574	0.573	0.574

Table 5. Effect of each component. Results are reported on KITTI. FPN: Feature Pyramid Networks. SOFE: Small Object Feature Enhancement. VRoIF: Variance-guided RoI Fusion.

FPN	SOFE	VRoIF	mAP	Car	Van	Truck	Pedestrian	Cyclist	Tram	Misc	mAR
✓			0.388	0.879	0.310	0.128	0.597	0.513	0.194	0.930	0.560
✓	✓		0.409	0.882	0.307	0.202	0.604	0.522	0.240	0.106	0.592
✓		✓	0.412	0.876	0.311	0.207	0.597	0.513	0.277	0.100	0.587
✓	✓	✓	0.415	0.882	0.310	0.212	0.599	0.501	0.284	0.116	0.606

Table 6. Effect of each component. Results are reported on PASCAL VOC 07+12. FPN: Feature Pyramid Networks. SOFE: Small Object Feature Enhancement. VRoIF: Variance-guided RoI Fusion.

FPN	SOFE	VRoIF	mAP	AP $_{50}$	AP $_{75}$	AP $_{S}$	AP $_{M}$	AP $_{L}$	mAR	AR $_{S}$	AR $_{M}$	AR $_{L}$
✓			0.523	0.835	0.574	0.224	0.387	0.571	0.621	0.359	0.490	0.667
✓	✓		0.528	0.838	0.576	0.244	0.393	0.576	0.626	0.341	0.501	0.671
✓		✓	0.530	0.843	0.577	0.265	0.401	0.576	0.626	0.389	0.498	0.670
✓	✓	✓	0.539	0.841	0.593	0.237	0.398	0.587	0.638	0.397	0.504	0.684

Table 7. Effect of each component. Results are reported on COCO 2017. FPN: Feature Pyramid Networks. SOFE: Small Object Feature Enhancement. VRoIF: Variance-guided RoI Fusion.

FPN	SOFE	VRoIF	mAP	AP $_{50}$	AP $_{75}$	AP $_{S}$	AP $_{M}$	AP $_{L}$	mAR	AR $_{S}$	AR $_{M}$	AR $_{L}$
✓			0.375	0.583	0.406	0.213	0.410	0.489	0.514	0.322	0.552	0.652
✓	✓		0.376	0.584	0.407	0.221	0.412	0.485	0.516	0.323	0.557	0.653
✓		✓	0.377	0.588	0.405	0.223	0.412	0.490	0.518	0.325	0.560	0.658
✓	✓	✓	0.383	0.587	0.414	0.217	0.420	0.495	0.524	0.329	0.565	0.661

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Q.; Zhang, C.; Wang, H.; He, Q.; Huo, L. SV-FPN: Small Object Feature Enhancement and Variance-Guided RoI Fusion for Feature Pyramid Networks. Electronics 2022, 11, 2028. https://doi.org/10.3390/electronics11132028

AMA Style

Yang Q, Zhang C, Wang H, He Q, Huo L. SV-FPN: Small Object Feature Enhancement and Variance-Guided RoI Fusion for Feature Pyramid Networks. Electronics. 2022; 11(13):2028. https://doi.org/10.3390/electronics11132028

Chicago/Turabian Style

Yang, Qianhui, Changlun Zhang, Hengyou Wang, Qiang He, and Lianzhi Huo. 2022. "SV-FPN: Small Object Feature Enhancement and Variance-Guided RoI Fusion for Feature Pyramid Networks" Electronics 11, no. 13: 2028. https://doi.org/10.3390/electronics11132028

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SV-FPN: Small Object Feature Enhancement and Variance-Guided RoI Fusion for Feature Pyramid Networks

Abstract

1. Introduction

2. Related Works

2.1. Small Object Detection

2.1.1. Data Augmentation

2.1.2. Multi-Scale Learning

2.1.3. Contextual Learning

2.1.4. Generative Adversarial Learning

2.1.5. Other Strategies

2.2. Feature Pyramid Networks

2.3. Preliminaries

3. Materials and Methods

3.1. The Overall Structure of SV-FPN

3.2. Small Object Features Enhancement

3.3. Variance-Guided RoI Fusion

4. Experiments

4.1. Datasets and Evaluation Metrics

4.2. Implementations Details

4.3. Experiment Results and Analysis

4.3.1. Weight Fusion Operation Selection

4.3.2. Proportional Statistics

4.3.3. Hyper-Parameter

4.3.4. Component Ablation Studies

4.3.5. Example Display

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI