*3.3. Feature Fusion*

Faster R-CNN extracts features through a convolutional neural network to generate feature maps for RPN and object classification and border regression. The quality of the feature map greatly affects the performance of the entire object detection algorithm. The Faster R-CNN uses the VGG-16 network as the basic network and produces a final feature map after the fifth convolution layer. Although the feature map obtains high-level semantic information through deep convolution layers, the high-level feature map loses a lot of detailed texture information due to high abstraction, resulting in inaccurate positioning of objects and difficulties in the detection of small target objects.

To solve the above problems, the AF R-CNN model we proposed in this paper fuses features from different convolutional layers. Given a picture, AF R-CNN uses the pre-trained model, VGG-16, to compute feature maps. Shallow and Deep CNN features are complementary for object detection. Figure 5 shows the detailed process of feature fusion. We fused feature maps from the third, fourth, and fifth convolution layers (con\_3, con\_4, con\_5). These feature maps have different resolutions because of subsampling and pooling operations. To get the feature maps of the same resolution, we adopted different sampling strategies. For the shallow layer, we added the maximum pooling to conduct subsampling. Additionally, we added the deconvolutional operation to carry out upsampling in the deep layer. The fourth convolutional layer remained at the original size. Then, we normalized multiple feature maps using local response normalization (LRN) [32]. For each feature map, an additional convolution layer was required. The feature maps were connected by the connection operation to form a new feature map. The new feature map contained shallow and deep layer information that was effective for object detection. The feature map resolution was more suitable for detection.

**Figure 5.** Detailed process of feature fusion. We fused feature maps from the third, fourth, and fifth convolution layers(con\_3, con\_4, con\_5).

#### *3.4. Visual Attention Mechanism*

AF R-CNN uses the two parallel branches of the trunk and the attention branch to generate attention-aware features and then select regions on the feature map after adding an attention mask. Our attention modules were constructed by a mask branch. Given the feature map, *T*(*x*), as the input, the mask branch learns the mask, *M*(*x*) by using the bottom-up top-down structure [7,33–35]. The mask, *M*(*x*), is used as control gates for *T*(*x*). The attention module *H*(*x*) is defined as:

$$H\_{i,\mathcal{c}}(\mathbf{x}) = M\_{i,\mathcal{c}}(\mathbf{x}) \times T\_{i,\mathcal{c}}(\mathbf{x}) \tag{2}$$

where *i* ranges over all spatial positions and *c* represents channels.

Figure 6 display the architecture of the mask branch. The mask branch contains two steps: Feed-forward sweep and top-down feedback. The two steps behave as a bottom-up top-down fully convolutional structure. The feed-forward sweep operation extracts global information quickly. Additionally, the top-down feedback operation feeds back the global information to the feature map. Firstly, max pooling was performed in the fusion feature map to improve the receptive field. The global information, which was expanded by the top-down structure, guides features when reaching the lowest resolution. The extended structure up samples global prior information by bilinear interpolation. To make the size of the output feature map the same as the input feature map, the output dimension of the bilinear interpolation was the same as the max pooling. Then, the attention module used the sigmoid layer to normalize the output results, with *M*(*x*). *M*(*x*) ranging from 0 to 1. This was the attention weight of the feature map at each location. This weight matrix, *M*(*x*), and the feature map matrix, *T*(*x*), were multiplied to obtain the desired attention-weighted feature map.

The structure of this encoder-decoder (bottom-up top-down) is often used in image segmentation, such as FCN [7]. We applied a bottom-up top-down structure to the object detection net. The structure of bottom-up top-down first extracted high-level features and increased the receptive field of the model through a series of convolution and pooling. Pixels activated in the high-level feature map can reflect the area where attention is located. The size of the feature map was enlarged by the up sample to the same size as the original input, so that the area of the attention was mapped to each pixel of the input. We called this the attention map, (*M*(*x*)). Each pixel value in the attention map (*M*(*x*)) output by the soft mask branch was equivalent to the weight of each pixel value on the original feature map, which enhanced the meaningful features and suppressed meaningless information. This weight matrix, *M*(*x*), and the feature map matrix, *T*(*x*), were multiplied to obtain the desired attention-weighted feature map, (*H*(*x*)).

**Figure 6.** The architecture of the mask branch. The mask branch contains two steps: Feedforward sweep and top-down feedback. The two steps behave as a bottom-up top-down fully convolutional structure.

The object detection module shares the convolution parameters with the attention module to achieve an integrated end-to-end object detection network. The mask branch aims at improving features rather than solving complex problems directly. It worked as a feature selector, which enhances good features and suppress noises from features.

#### *3.5. AF R-CNN Loss Function*

We minimize an objective function following the loss in the Faster R-CNN. The loss function in AF R-CNN is defined as:

$$L(\{p\_i\}, \{t\_i\}) = \frac{1}{N\_{cls}} \sum\_i L\_{cls}(p\_i, p\_i^\*) + \lambda \frac{1}{N\_{reg}} \sum\_i p\_i^\* L\_{reg}(t\_i, t\_i^\*) \tag{3}$$

Hence, *i* represents the index of the anchor. The *pi* is the predicted probability whether anchor, *i*, is an object and *ti* is a vector of four parameters in the predicted bounding box. The classification loss, *Lcls*, is the loss over two classes (object vs. not object). We used *Ncls* and *Nreg* to normalize two terms, cls(classification) and reg(regression).

The ground-truth label is defined as follows:

$$p\_i^\* = \begin{cases} & 1, \text{ the anchor is positive} \\ & 0, \text{ the anchor is negative} \end{cases} \tag{4}$$

The classification and regression loss are defined as:

$$L\_{r\%}(t\_i, t\_i^\*) = R(t\_i - t\_i^\*) \tag{5}$$

where, *R* is the robust loss, which is defined as follows:

$$R(\mathbf{x}) = \begin{cases} \quad 0.5\mathbf{x}^2 if \, |\mathbf{x}| < 1\\ \quad |\mathbf{x}| - 0.5 otherwise \end{cases} \tag{6}$$

We set *λ* = 10, so that the weight of the classification and regression terms were roughly equal. The *ti* is defined as follows:

$$t\_x = \frac{(\mathbf{x} - \mathbf{x\_a})}{w\_a}, t\_y = \frac{(y - y\_a)}{h\_a} \tag{7}$$

$$t\_w = \log(\frac{w}{w\_a}), t\_h = \log(\frac{h}{h\_a}) \tag{8}$$

$$t\_x^\* = \frac{(\mathbf{x}^\* - \mathbf{x}\_d)}{w\_d}, t\_y^\* = \frac{(y^\* - y\_d)}{h\_d} \tag{9}$$

$$t\_w^\* = \log\left(\frac{w^\*}{w\_d}\right), t\_h^\* = \log\left(\frac{h^\*}{h\_d}\right) \tag{10}$$

where *x*, *y*, *w*, and *h* are the center coordinates of the box and width and height. *xa*(*ya*, *wa*, *ha*) represents the same mean in the anchor box and *x*∗(*y*∗, *w*∗, *h*∗) represents the ground-truth box.

#### **4. Experiments**

We performed a series of comprehensive experiments on the dataset, PASCAL VOC 2007 [36] and 2012. The PASCAL VOC 2007 dataset contains 20 object categories. The pixel size of the image varies, and is usually (horizontal view) 500 × 375 or (longitudinal view) 375 × 500. The mean average precision was used to measure the performance of the object detection network. All experiments were performed on a ThinkStation P510 PC with Intel(R) Xeon(R) E5-2623 2.6GHZ CPU and Quadro M5000 GPU with an 8 GB memory. Our experiments were implemented in Python with TensorFlow.

In our experiments, the AF R-CNN was trained on a training set and the detection results were obtained on the test set (PASCAL VOC 2007 test set). Training data 07 represents VOC 2007 trainval and 07+12 represents the union set of VOC 2007 trainval and VOC 2012 trainval. AF R-CNN and faster R-CNN use the same pre-trained model, VGG-16, which has 13 convolutional layers and three fully-connected layers. All the networks were trained using stochastic gradient descent (SGD) [37]. The initial learning rate was set to 0.001. As AF R-CNN, we re-scaled the images and resized the short side to 600. The initial image needed to be processed. The size of the image was fixed to 224 × 224 by mapping the image. Then, the fixed size image was input into the VGG network for feature extraction. We compared the AF R-CNN and faster R-CNN for object detection on the dataset, PASCAL VOC 2007.

Our experiments contained two parts. In the first part, we analyzed the effectiveness of each component in the AF R-CNN, including feature fusion and the soft mask branch in the attention module. After that, we compared our network with state-of-the-art results in the PASCAL VOC 2007 dataset.

Table 1 shows the results of the different object detection network. The first and second columns represent the results of the Faster R-CNN on the dataset, PASCAL VOC 2007. The third(fourth) column shows the results of the method, which we only added the attention module (feature fusion) in the Faster R-CNN. Additionally, the last two columns show the results of AF R-CNN, which we proposed in the article that contained both feature fusion and the attention module.

To understand the benefit of the attention mechanism, we calculated the mean average precision of the object. The attention module in the AF R-CNN is designed to suppress noise while keeping useful information by applying the dot product between the feature map and soft mask. From Table 1, we can see that the performance of the object detection network with the addition of the attention module was improved. In the experiments, we evaluated the effectiveness of the attention mechanism. The Faster RCNN+attention achieved an mAP (mean average precision) of 70.3%, 0.4 points higher than the Faster R-CNN. To understand the benefit of the attention mechanism, we calculated the mAP (mean average precision) of 20 object categories. The attention mechanism can enhance the feature contrast. The attention-aware features were more effective for significant objects, such as the car, boat, and person. However, for small targets, the detection efficiency was slightly reduced, such as the bottle, bird, and plant. This is because the attention mechanism was more effective for significant

objects. However, for less significant targets, such as smaller targets and shallower targets, the attention mechanism did not achieve the desired results.


**Table 1.** Results of the Faster R-CNN and AF R-CNN on the dataset, PASCAL VOC 2007.

The mAP of the detection network Faster R-CNN+fusion was 75.0%, which was 5.1% higher than the Faster R-CNN on the dataset, PASCAL VOC 2007. As we have shown above, this is because the fusion features were more accurate than the deep convolution features. The results benefitted from more informative features. The reasonable resolution of features made for better object detection, especially when the object size was small. Our detection network outperformed the Faster R-CNN when the object size was small. For the plant, Faster R-CNN+fusion achieved a 53.8% mAP, a 14.7 points improvement, and for the bottle, the Faster R-CNN+fusion achieved a 60.2% mAP, 10.3 points higher than the Faster R-CNN. It showed that the multi-layer feature fusion could effectively improve the detection of small targets.

Our AF R-CNN outperformed the Faster R-CNN regardless of the object size. We compared our AF R-CNN with state-of-the-art methods, including Faster R-CNN [5], A-Fast-RCNN [19], CRAFT [38], and SSD [39] on the dataset, PASCAL VOC 2007. The results are shown in Table 2. Our AF R-CNN outperformed all methods on the PASCAL VOC 2007 dataset. The object detection network that we proposed in this article achieved an mAP of 75.9% on the dataset, PASCAL VOC 2007, 6 points higher than the Faster R-CNN because of feature fusion and the attention module. The above results suggest that our method enjoyed high efficiency and good performance.

**Table 2.** A comparison between the proposed method with the existing models on the dataset, PASCAL VOC 2007.


Figure 7 shows the visualization of the object detection. Left: The input images. Middle: The object detection of the Faster R-CNN. Right: The object detection of AF R-CNN. We found a few more representative pictures. First, in order to see the effect of our method on the object size, we selected a big plant (the fourth row) and a small bird (the second row). To assess the effectiveness of our approach to multiple goals and target defects, we chose the pictures in the first, third, and fifth cows. Of course, these pictures had crossovers, such as the last picture had many objects and the objects were very small. As can be seen from the results shown in Figure 7, the effect of the overall network was improved. The attention module helped detect significant targets while the feature fusion could effectively detect small targets. The experiments proved that our proposed method was more effective than Faster R-CNN.

**Figure 7.** Results of object detection. (**a**) shows the input images of the object detection networks; (**b**) shows the visualization of the detection results of the Faster R-CNN; (**c**) shows the visualization of the detection results of the AF R-CNN.
