*3.4. Discussion*

In the literature, several results can be found concerning the UBEAR dataset. In [10], authors reported the accuracy of their method to be equal to 98.22%. This is a significantly better result than other, mentioned in that paper, techniques. The traditional Faster R-CNN was able to reach 65.56%, whereas AdaBoost gained 51.74%. An ensemble of classic CNNs, described in [12], achieved an accuracy equal to 75.08%. Our results cannot be directly compared with these values because of several reasons. Firstly, in the mentioned papers, models were trained using data coming from datasets other than UBEAR. Secondly, other methodologies were used to indicate detections (bounding boxes), and consequently, other evaluation measures had to be used to verify their correctness. Thirdly, we took into account only left ears assuming that for the right ones, our approach would behave in a similar way. Finally, our network was trained using only one pose of the head.

That is why, to show convincingly and objectively the quality of our method, we decided to train our model using set D*TR ALL* with all the poses. The obtained results are included in Table 3. The network Φ*ALL*, trained in this way, was able to recognize images in the training, validation, and test sets with accuracies of 98.75%, 92.91%, and 94.15%, respectively, which were very satisfactory. Nevertheless, since the training data were more complex, to further improve them, we extended our basic architecture by an additional layer on the input of the network. This layer consisted of 20 filter groups. The results of that network, denoted as Φ<sup>+</sup> *ALL*, are also shown in Table 3. One can observe an improvement, in particular for images in validation set D*VA ALL*.

**Table 3.** Comparison of the detection accuracy for different subsets of the UBEAR dataset and different detection models. The reference solution is network Φ0, which was the best network trained using only only samples from <sup>D</sup>*TR <sup>M</sup>* (basic head orientation). The second column represents the approach where the maximum of networks Φ*<sup>θ</sup>* outputs indicates the ear position. In the third column, detection was considered successful if at least one Φ*<sup>θ</sup>* output showed correct ear position. It presents the maximum possible accuracy if the correct *θ* can be found. The fifth and sixth column contain results of models trained using images with all head poses gathered in set <sup>D</sup>*TR ALL*. Network Φ*ALL* has the same architecture as <sup>Φ</sup><sup>0</sup> (4 layers), whereas <sup>Φ</sup><sup>+</sup> *ALL* has an additional layer at the beginning (5 layers). The last column contains the cardinality of every considered set.


**Figure 13.** Pictures (**a**–**k**) present the outputs of successive networks Φ*<sup>θ</sup>* . The green rectangle shows the expected ear location. Yellow and red dots with a line indicate superpixels with maximum value (yellow identifies the maximum among all the outputs; the line shows the orientation of filters in the network). The last picture (**l**) shows the input image. Starting from angle *θ* = *π*/5, networks are able to detect ear correctly. In (**b**,**k**), unexpected artifacts can be noticed. Picture (**f**) demonstrates an alternative detection region, which, when observed locally, can be indeed mistakenly recognized as an ear.

The presented results revealed also indirectly that there was redundant information in image pixels. In order to detect ears, it was not required to operate on millions of pixels, which usually leads to very complex models. Classic CNNs, usually used for semantic segmentation (e.g., FCN [27], DeepLab [28], or SegNet [29]), require a big computational effort because they have tens of millions trainable parameters. Our network had only several thousands of parameters. This, in turn, sped up both the training and processing of single images. On the same CPU, the forward pass of one graph took on average 0.14 s, while FCN processed the one scaled-down image about 15.6 times slower, i.e., in 2.19 s. Even taking into account the time required to transform the image into a graph (superpixel generation with the use of SLIC algorithm lasted 1.65 s), our approach allowed getting very good results significantly faster.

**Figure 14.** Selected outputs of networks Φ*<sup>θ</sup>* for images where the maximum node value does not allow detecting ear correctly: (**a**,**d**) input image; (**c**,**e**) output with correct detection; (**b**,**f**) output with the node having the maximum value, wrong detection. The unexpected local maxima can be observed in areas of uniform color. The convention of the result presentation was described in Figure 13 caption.
