*3.3. Balanced Sampling*

According to the dividing method mentioned above, a balance sampling approach is proposed: the positive samples are balanced according to the scale size to form a positive sample set. For the negative samples determined by the anchor box, the negative sample set is formed by balanced sampling with comprehensive consideration of the difficulty and scale size. For the sample set with an upper limit of N, the sample collection method designed in this paper is demonstrated in Algorithm 1.

```
Algorithm 1 Balanced Sample Algorithm.
Inputs:
```

```
Positive/Negative Sample Sets;
 2: Number of Select Samples N;
Outputs:
   Sample Set U;
 4: divide_num = N set_num
   U = []
 6: sort(Sets)
   for set in Sets:
 8: if nset > divide_num:
            U.append(sample(nset, divide_num))
10: else:
            U.append(nset)
12: reshape(divide_num)
   return U
```
Ideally, the total number of positive and negative samples should be equal, therefore this approach initializes *divide*\_*num* to the average of the total number of sample sets *<sup>N</sup> set*\_*num* , if number of samples of all the intervals satisfies *nset* > *divide*\_*num*, it is only needed to randomly sampling in each interval to generate set *U*. However, the ideal condition mentioned above is hardly appear in actual situation, therefore balance sampling is a problem that should be considered. If the total number of samples is less than the upper sampling limit N, it is necessary to include all samples in the sample set; Otherwise, the number of uniformly sampled objects in each interval, *divide*\_*num*, is calculated based on the interval data, *set*\_*num*. Sampling is then carried out from low to high according to the sample data in each interval. If the number of samples in the current interval, *nset* > *divide*\_*num*, then *divide*\_*num* samples are randomly selected to be included in the sample collection of the current interval; otherwise, all *n*\_*set* samples are included in the sample collection, and *divide*\_*num* is adjusted for subsequent sampling intervals using the reshape method.

The key point of the balanced sampling method is the reshape method for *nset* < *divide*\_*num*. All samples in these kind of interval should be retained since the demand number of samples if more than the actual collected samples. Since the order of sampling approaching is depend on number of samples in each interval, therefore all of the subsequent intervals are redundant, which means the subsequent intervals are satisfy the following condition:

$$\sum\_{i=j+1}^{\text{set\\_num}} > (\text{set\\_num} - j) \* \text{divide\\_num} + (\text{divide\\_num} - \text{num}\_{\text{set}}) \tag{4}$$

In Equation (4), *j* represents the index of the current interval set in all sorted intervals. Since the surplus samples can be collected in the subsequent sampling process, a sufficient number of samples can be still collected. Therefore, as many as possible samples should be collected from the remaining intervals while maintaining the balance. The reshape method for updating *divide*\_*num* is designed as the follow:

$$num\\_divid = \frac{(N - \sum\_{i=1}^{j} n\_i)}{set\\_num\_{left}} \tag{5}$$

In Equation (5), *set*\_*numlef t* represents the number of remaining intervals. Since samples of each subsequent interval is updated. Take the collection process of the positive samples as an example and suppose that the number of samples in small\_positive intervals, nsmall, is the lowest and less than divide\_num. Then, *divide*\_*num* is updated to (*N* − *nsmall*)/2 for the sampling process in the subsequent intervals. If the number of samples in the medium\_positive and big\_positive intervals is greater than the updated value of divide\_num, then they are uniformly sampled.hrough the balanced sampling method, factors such as scale and difficulty are fully considered in the process of generating the sample set, which can effectively increase the number of small object samples and ensure sample diversity.
