**2. Related Works**

In [6], a deep neural network based object detection approach called RCNN is first introduced. It is composed with three parts: First, by adopting selective search algorithm, RCNN generate a series of candidate region, each of them may responsible for detecting a specific object in the image. Second, extract feature of candidate regions by CNN, the network will be train supervised by measuring the error between prediction and ground-truths. Finally, SVM and bounding box regression is adopted to finetune the predicted results. The framework of RCNN is similar to tradition approaches except CNN is adopt to extract image features instead of handcraft features. In [7], an approach called Fast RCNN is proposed, it take the whole image as inputs, then crop the corresponding features of each candidate region and map them to a uniform size by region of interest(ROI) pooling, finally feed the mapped features into classification and regression network to acquire its category and localization. The fast RNN integrate coarse and finetune process in RCNN which improve both the detection performances and efficiency. In [5], an end-to-end CNN based object detection framework called Faster RCNN is introduced. Instead of generate candidate by selective search, this approach proposed region proposal network(RPN), it first generate a series of anchor boxes with different scale and aspect ratio, each of them is responsible for detecting object or not is depend on the inter section of union(IOU) between its coordinate and annotated ground-truth. As the first end-to-end object detection approach, Faster RCNN laid the foundation of subsequent deep neural network based object detection approaches. In [3], an approach which integrate the function of RPN and classification/regression network in one series convolution layers called SSD is introduced. At present, approaches whose framework are similar to Faster RCNN are summarized as two-stage approaches, in contrast the approaches like SSD are called one-stage approaches.

Based on backbones such as Faster RCNN and SSD, variety of approaches were proposed to enhance the detection performances. For example: Feature Pyramid Networks(FPN) proposed an pyramid architecture image feature extractor, objects in different resolution are arranged to corresponding layers, compare with the origin Faster RCNN, in dealing with a specific object FPN will provide more proper feature [8]. Cascade RCNN proposed training process that discrimination IOU threshold is gradually increased, which makes the classification and regression network training in a easy-to-hard way [9]. RetinaNet proposed focal loss which could enlarge the weight of hard samples, which makes the training process focuses on the hard samples [10]. Libra RCNN proposed a balanced sampling method, feature extraction and loss function in training process [11]. SRetinaNet propose an anchor optimization method which will help detecting small objects with specific parameter setting [12]. GA-RPN propose an anchor optimization method by combining anchor box with semantic features [13].

Regardless approaches being one or two stage, the fundament for object detection is the matching mechanism of anchor and ground-truth, which determines how many samples can be included in the network training. Therefore, it is important to propose a proper matching mechanism for enhancing the detection performances. However, the stochastic of objects scale poses challenges to the matching mechanism [12,13].
