3.2. Mask-Refined Region-Convolutional Neural Network (MR R-CNN)
Mask R-CNN requires the correctness of the lightweight mask head. However, the “weight gain” of the mask head via a suitable method still has a substantial impact. The structure of MR R-CNN is illustrated in
Figure 2. The details of our framework are as follows.
Mask R-CNN is the basic network of this article, and its pipeline is shown in
Figure 3. Here is a brief introduction to this pipeline.
Backbone: For each input image, Mask R-CNN uses Residual Network (ResNet) as the backbone network for feature extraction. FPN is added to the backbone, which includes three channels: bottom-up, top-down and lateral connection. The bottom-up channel uses ResNet, which is divided into 5 stages according to the size of the feature map; except that stage1’s conv1 is not used, and the outputs of the last layer of stage 2 to stage 5 are defined as , respectively. Their stride to the original image is {4, 8, 16, 32}. The top-down channel is up-sampling from the highest layer, and the lateral connection is used to fuse the up-sampling results of the feature map of the same size generated from the bottom-up channel. Each layer in undergoes a conv 1 × 1 operation, and all the output channels are set to 256, and then concatenate is operated with the up-sampled feature map.
RPN: After backbone, each feature image is input to RPN. RPN is divided into two paths. The first path classifies anchors through softmax to obtain positive and negative classification. The second parallel path calculates the bounding box regression offset of each anchor to obtain an accurate proposal. The final proposal layer is responsible for synthesizing positive anchors and corresponding bounding box regression offsets to obtain proposals, while excluding too small and out-of-boundary proposals. It is worth noting that the RPN will select the most appropriate scale from the output results of the backbone network feature pyramid
(for
) to extract the region of interest (ROI). A formula to decide which
the ROI of width
and height
should be cut from is:
Here, 224 represents the size of images in ImageNet for pre-training. ROI with an area of should be at -th level. Here, is 4, the optimized value experimentally obtained in Mask R-CNN, which means that the ROI of should be extracted from . Assuming the scale of the ROI is less than 224 ( for instance), , this means that ROI will be extracted from higher-resolution . This approach is reasonable. Large-scale ROIs should be extracted from low-resolution feature maps, which are good at detecting large objects, and small-scale ROIs should be extracted from high-resolution feature maps, which are good at detecting small objects. After this series of operations, the object is finally positioned.
ROIAlign: Mask R-CNN proposes ROIAlign to replace the ROIPooling in Faster R-CNN. In ROIPooling, there are two rounding processes. One is that of each region proposal is usually not an integer, but it will be integerized for convenience. The second is to divide the integerized boundary area into cells on average, and integerize the boundary of each cell. After the two roundings mentioned above, each region proposal at this time has deviated from the original position. To solve this problem, ROIAlign cancels the rounding operation and retains the decimals. The bilinear interpolation method is used to obtain float coordinates.
Classification and bounding box regression: As with Faster R-CNN, Mask R-CNN uses softmax to classify each acquired ROI, and the other parallel path performs bounding box regression and non-maximum suppression is performed in it to remove the bounding boxes marked multiple times for the identical object.
Mask: A new head network. Feature maps are classified at pixel level through a simple fully convolutional network consisting of convolutional layers and deconvolutional layers. Since each proposal contains only one foreground object, semantic segmentation of each proposal is equivalent to instance segmentation of the original image.
The segmentation mask of Mask R-CNN is not accurate enough. For semantic segmentation, improving the quality of the mask is always a challenge. This is because the receptive fields corresponding to adjacent pixels in the image often have very similar image information. This “similarity” has both advantages and disadvantages. If adjacent pixels are located inside the foreground object or background, then this “similarity” is advantageous, so these internal pixels can commonly be predicted correctly. However, if the adjacent pixels are located at the object edge, then this “similarity” will have a negative impact. The network structure of the mask head lacks the consideration of the receptive field. This causes the network to consider the global context information incompletely, and the relationship between pixels is also ignored.
In order to solve this problem, a feature pyramid network is established in the mask head in this paper. By fusing the information of feature maps of different scales, the network simultaneously takes into account the feature information of receptive fields of different sizes, and the network has sufficient contextual information to assist classification during segmentation. In order to achieve the goal of feature fusion, we must study the structure of the mask head network and adjust the scale of the input image (transmitted from ROIAlign) accordingly.
The essence of ROIAlign is “resize”. Its role is to convert a large number of feature images of different scales to the same size, thereby facilitating subsequent head network operations. The size of the image after ROIAlign is worth studying. Because the number of feature maps is much larger than the original image, if the size of the feature maps is too large, the subsequent head network will be overwhelmed and time-consuming. Although a small feature map scale is helpful to improve the prediction speed of the network, it will also degrade the prediction accuracy because of too much image information loss.
In Mask R-CNN, the maps transferred to ROIAlign with strd = 32, which is the optimized stride according to experiments. For mask prediction, this improvement is substantial. The creation of feature pyramids requires scaling the map multiple times and fusing the feature maps of various scales [
20]. To create a feature pyramid in the mask head, it is not sufficient to rely on the original input size, and the model can only perform the add operation once. It is feasible to preserve the stride of ROIAlign and to magnify the map via bilinear interpolation in the mask head; however, the interpolation algorithm will damage the map information. Therefore, we keep the other head input sizes constant, multiply the stride of ROIAlign with the mask head, and neutralize the effect of this operation via the enhancement of the feature pyramid. Strd = 16 for ROIAlign because it is found to yield high accuracy.
The refinement framework is the core of the proposed algorithm. To obtain the best experimental results, 3 factors must be considered simultaneously, including the size of the image passed into the mask head network, the number of feature fusions operations, and the means of feature fusion. The size of the input image greatly affects the computation time; the number of feature fusions operation affects the degree to which the network considers the relationship between pixels and contextual information; and the way the feature fusion affects both the calculation speed and the segmentation accuracy. The optimized refinement frameworks finally arrived at are described below.
As shown in
Figure 2, the feature map that passes through the ROIAlign of strd = 16 is input into our mask head, and it is doubled down by the forward network convolution-pooling-batch normalization operations until the size of the resulting feature maps is 7 × 7. The feature maps can no longer continue to be zoomed out. Then, through the reverse network, the map is enlarged by using 2 × 2 deconvolutional layers with ReLU, and the add operation is performed on the forward-propagating feature maps of the same resolution via a 3 × 3 convolutional layer and a ReLU. Next, the maps pass through a 3 × 3 convolutional layer and a ReLU. If the map in the forward network is directly transmitted, the amount of information will be large, and the back-propagated signal is likely to be flooded. Therefore, we reduce the amount of information of the forward maps via convolution and ReLU while maintaining the map resolution. The computational cost of the network is also decreased. After two operations, the map size is restored to 28 × 28, namely the size in [
21], and the map is output by a 1×1 deconvolution. We have also tried to replace the add operation with the concatenate operation; its effect is poor compared with the add operation.
In order to further improve network performance, we conduct further research on the way of feature fusion. The fusion of feature maps that are propagated through the forward network and back-propagated maps is critical [
20]. The amount of information of the feature maps from the forward network is much larger than that from the back-propagation network. The direct addition of these maps raises two issues: one is that the amount of computation is too large, and the other is that the former’s signal will flood the latter’s signal, thereby degrading the impact of the latter [
45,
46]. Therefore, a 3 × 3 convolutional layer with a ReLU is used to act on the forward network. This step just reduces the amount of information in the feature map and does not change the size of the feature map. Then, the feature maps are subjected to an add operation with the backpropagated map and, finally, are propagated forward through a 3 × 3 convolutional layer with a ReLU to reduce the amount of information. This lateral connection is highly effective, and the deeper the FPN, the higher the impact of this step.
The head of the network is trained through RPN proposals. The training samples must have an intersection over union (IoU) that is larger than 0.5 between the proposal and the corresponding ground truth. This is consistent with Mask R-CNN. To generate a regression object for each training sample, the prediction mask of the object class is obtained, and the prediction mask is binarized using a threshold size of 0.5. In training, the pre-training parameters are used. First, the backbone and ROI parameters are held constant, and the mask head is trained separately. After the optimal head has been determined, fine-tune training is conducted on the complete network.