*3.3. Experiment II*

In the second experiment, the best network, trained to detect ears in their basic orientation (pose M), was applied to detect ears when head was rotated in the image plane (poses U and D). This network, trained only using samples from D*TR <sup>M</sup>* , will be further denoted as Φ<sup>0</sup> to indicate that it was not rotated. In Table 2, initial detection results for subsets of D*U*,*<sup>D</sup>* obtained with this network are presented. These results, between 80% and 90%, were surprisingly good. Three explanations seem to be possible. Firstly, apparently, in the UBEAR dataset, most of the cases with U and D pose were similar to pose M (the head rotation was not large). Secondly, ear orientation relative to the head is an individual feature. Slightly rotated ears in D*TR <sup>M</sup>* could allow Φ<sup>0</sup> to learn how to recognize them in D*U*,*D*. Finally, which in general is an interesting hypothesis, to detect ears, it may be sufficient to observe only the configuration of image regions containing head. Humans need not see ear details to make correct detections. Perhaps, our CNN, working on reduced graph representations, did the same thing.

To check if network rotations can help in the detection of rotated ears, we prepared a set containing 11 networks Φ*θ*. They were created based on network Φ<sup>0</sup> where filters *ϕ* were rotated by angles *θ* equally distributed in interval [−*π*, *π*]. Of course, the original network Φ<sup>0</sup> was in this set. Next, every image was processed by all those networks, and that output was considered to be the final result, for which the maximum node value was observed (Figure 13). It was expected that such system would possess the rotation equivariance property, i.e., the network Φ*<sup>θ</sup>* with correct angle *θ* correlated with ear orientation should give the rotated response of the network Φ<sup>0</sup> for basic ear orientation.

The results obtained in this way are presented in Table 3. To our surprise, they were worse than the result of separate network Φ0. This means that the network rotations introduce additional maxima in the wrong regions of the image. They can be observed in Figures 13 and 14. Two typical reasons for incorrect detections are presented there. Firstly, areas, which locally, at a certain angle, can be considered similar to the ear, were indicated (Figure 14f). This behavior could be expected and cannot be avoided at this level of image representation. Secondly, there are maxima in completely unexpected locations (Figure 14b). We have a suspicion that those artifacts are caused by the characteristic of superpixels generated by the SLIC algorithm in regions of uniform color. They were very regular and, since the network Φ<sup>0</sup> was trained only for one orientation, it was not able to give correct responses in situations when the graph was rotated.

After further analysis of the results, we also noticed that even if the output with maximum node value did not allow improving the detection results, the ear localization was frequently indicated correctly by one of the networks Φ*<sup>θ</sup>* (Figure 14). To check if this was a general rule, we conducted an additional experiment where the results were accepted (correct detection) if any network Φ*<sup>θ</sup>* was able to solve the task. Those results are also shown in Table 3. Accuracy calculated in this way, for all poses, was above 96%. It proved that the required information was not lost and rotations of the trained filters allowed constructing a satisfactory solution.
