*3.6. Detection*

Runtime operation of the network is performed through Shared Map execution of CNNs. This allows for an optimized method of inferring detection predictions from a full image frame in a manner that is much more efficient than the traditional sliding window approach.

The process requires the input image to be first prepared as a multi-scale pyramid. This is simply to be able to detect ears in all possible sizes relative to the image frame, so as to be able to properly carry out the detection, regardless of the subject's relative distance to the camera.

Each of these pyramid levels will be given to each of the three networks to be analyzed independently. Each network, thus, creates three output maps per level, corresponding to each of the target classes trained, LE, RE, and BG. Figure 6 depicts the shared map execution of one of the networks for a particular pyramid level of size 274 × 366.

**Figure 6.** Shared map execution of one of the CNNs over a sample input image.

Every pixel in each of these output maps corresponds to that class' predicted likelihood at a window whose location can be traced back to the input image according to the shared map's alignment and position configuration. Figure 7 shows how windows can be re-constructed from these shared maps and they correspond precisely to the multiple detections that a traditional sliding window approach would produce, but at a fraction of the computing time.

**Figure 7.** Sample of multiple overlapping detections casted as individual detection windows on an input image.

In order to collapse these multiple detections into a single final result, a partitioning algorithm based on Disjoint-set data structures is used. This is very similar to the *groupRectangles* and *partition* functions of OpenCV [35], but customized in a few particular ways. This algorithm allows the grouping of similarly positioned and scaled windows as all belonging to a single object detection. Figure 8 shows a diagram of how the grouping algorithm would behave on various sample window clusters.

**Figure 8.** Sample of how the partitioning and grouping algorithms cleans up multiple overlapping detection windows.

This is a very common practice taken as a post-processing cleanup procedure in many computer vision tasks. For this particular work, however, a special grouping rule is created in order to weigh the grouping allowance.

For each of the two positive classes, LE and RE, the following procedure is performed:

Every window *i* has a value assigned to it corresponding to the neural network output prediction value at that window, denoted by *Oi*. This window weighs its own value by squaring itself. Therefore, windows with a low prediction value have their overall importance reduced, whereas windows with a large output value to begin with, maintain their standing in the grouping.

For a potential grouping cluster *j* composed of *N* multiple windows, each with a weighed output value *O*<sup>2</sup> *i*, the final output value *Gj* for the group is then given by:

$$G\_{\dot{j}} = \sqrt{\frac{\sum\_{i}^{N} O\_{i}^{2}}{N}} \tag{5}$$

This corresponds to an RMS of all composing window output values in that cluster. The end result of this is that the process favors those clusters that are composed of windows with large significant confidence outputs, where as windows with low confidence (such as in the case of false positives) end up with a lower value.

As each cluster has a single final numerical value assigned to it corresponding to its overall significance, a thresholding operation can be passed through all final clusters in order to reject those with low confidence.

In order to find a suitable threshold value, an experiment was performed over the full UBEAR test dataset. All final clusters generated in this process were then manually classified as either True Positive or False Positive. Figure 9 shows the distribution of True Positive cluster output values and that of clusters classified as False Positives. After analyzing these distributions, it can be seen that the chosen threshold value of 0.224 most optimally separates it, where a balance can be achieved in rejecting the largest majority of false positive hits, while keeping as many true positives as possible above the threshold.

Note that this process, although similar to traditional Non-Maximum Suppression (NMS), has the added advantage of providing a better filtering mechanism of detections that are likely to be false positives. NMS simply clusters boxes together and keeps the box with highest confidence per cluster, regardless of the distribution of confidence values in the remaining boxes. The proposed method, by comparison, takes into account a weighted distribution of all contributing detections in order to make a more informed decision on the filtering, as this method requires all contributing detections in each clustered set to have a higher confidence value.

**Figure 9.** Response of CNN outputs for true positive (TP) and false positive (FP) groups.

A summary of this whole process, starting from the inference, continuing through the grouping, and ending in the thresholding operations, is listed in Algorithm 1:

**Algorithm 1** The proposed process including steps for inference, grouping, and applying the threshold.

```
for all Z ∈ {PyramidScales} do
   for all A ∈ {L, M, S} do
       O LE A ,O RE A ,O BG A ← SharedMap(ImageZ, NetworkA)
   end for
   for all K ∈ {LE, RE, BG} do
       O KF,Z ← Ensemble(O KS ,O KM,O KL )
   end for
end for
for all K ∈ {LE, RE} do
   G K ← Group(O KF,Z)
   if G K > Threshold then
       Keep(G K)
   else
       Discard(G K)
   end if
end for
```
The correct threshold to use should be carefully decided upon depending on the type of data being analyzed. In the case of the AMI database, where images are already prepared as cropped ears, the system detects no False Positives whatsoever, and thus the threshold value decision does not affect the False Positive rate in any way. In this case, a very low (or zero) threshold can be chosen in order to maximize the number of correctly detected ears. This can be seen in the results shown in Figure 10, where the accuracy rate of varying threshold amounts is depicted.

In the case of natural images in non-cooperative environments as with the UBEAR dataset, the effect of false positives is much more important, as can be seen in Figure 10, where small variations in the threshold value lead to a drastic drop in the false positive rate, while not significantly affecting the accuracy of detected ears.

**Figure 10.** Threshold sensitivity on ear detections: (**Left**) AMI Dataset Detections, (**Right**) UBEAR Dataset Detections.
