**4. Summary and Future Work**

In this work, it was proven that the application of CNN operating on graphs for semantic segmentation allowed effectively solving a biometric task of ear detection. The best trained model was able to achieve more than 94% (depending on the considered subset) accuracy on the UBEAR dataset, which contained images with different head orientations, different illumination conditions, and where occlusions and motion artifacts were present as well. This result (Table 3) was comparable with the best results reported in the literature on the same dataset. What is more, the reduced, superpixel based image representation (hundreds of superpixels instead of millions of pixels) allowed constructing a relatively simple model (fewer parameters), which processed data faster than classic CNNs.

We also showed that the specific, used in this paper, network with GMM filters could be used to construct a system having the rotation equivariance property (Figure 2). It can be trained with a limited amount of data, where only structures in one orientation are available and no augmentation is used. The experiments revealed that such a model potentially allows achieving results even better than the model trained using structures in all their possible orientations (Table 3). Additional investigation is, however, required to understand and filter out the artifacts that appear.

The superpixel based image representation used in this work is not the only option. We are currently exploring other alternatives focusing not only on the character of the elements describing image content, but also on their faster generation. This and the short processing time of our networks should allow constructing systems able to detect ears efficiently on devices with only a CPU available.

Another interesting research direction is further theoretical analysis of the presented approach. It was shown that even a very coarse representation (relatively small number of superpixels) of image content allowed getting satisfactory results. This is not fully surprising since humans can do that with 100% accuracy for images in the UBEAR dataset. Apparently, ear details are not required to detect

them correctly. It is suspected that for humans, it is enough to identify only the region containing head. Further research should show if such a mechanism also takes place in our networks. All the more, such an analysis, thanks to the reduced image content representation, should be easier than the analysis of classic CNNs. Firstly, this is because, we were not operating on a huge set of pixels, and consequently, the analyzed network was simpler (only a few layers are enough to cover a large visual field). Secondly, this is because humans are not operating consciously directly on pixels, and the explanation of the algorithm behavior in terms of easily understandable, small, and homogeneous regions will be more natural and convincing. That potential of explainability can be considered as an additional advantage of the presented technique.

**Author Contributions:** Conceptualization, A.T. and P.S.S.; methodology, A.T. and P.S.S.; software, A.T.; validation, A.T. and P.S.S.; formal analysis, A.T.; investigation, A.T.; resources, A.T.; data curation, A.T.; writing, original draft preparation, A.T. and P.S.S.; writing, review and editing, P.S.S.; visualization, A.T.; supervision, P.S.S.; project administration, P.S.S.; funding acquisition, A.T.

**Funding:** This project has been partly funded with support from the National Science Centre, Republic of Poland, Decision Number DEC-2012/05/D/ST6/03091.

**Conflicts of Interest:** The authors declare no conflict of interest.
