**1. Introduction**

A biometric can be defined as a measurable, physical characteristic, which can be used to identify individuals. There are various types of biometrics used in practical applications: voice recordings, fingerprints, signatures, DNA, hand geometry, iris and face images, or even keystroke dynamics, to mention only a few. A good biometric should have several properties [1]. It should be universal (everyone should possess this characteristic), distinctive (it should allow discriminating between people) and permanent (ideally, it should not change in time). Moreover, the process of acquisition should be inexpensive, generally acceptable, and not troublesome (in some applications, it should be even discreet). Finally, the identification system using such a biometric should be hard to circumvent. The biometrics mentioned above meet those expectations to varying degrees. It is relatively easy to forge a signature, whereas a DNA test usually is hard to falsify. Similarly, collecting face images, as a rule, is treated as a violation of privacy, while taking fingerprints seems to be natural.

In this work, ear images are considered [2,3]. There are several factors that make the research and application of ear recognition important and attractive: people can be identified on that basis; this biometric does not change in time; and its gathering does not create a great deal of controversy. Furthermore, technology enables acquisition of ear images from a distance, which may be of great importance for police investigations and in security systems. It has, however, at least two consequences. Firstly, before the individual can be identified [4,5], the localization of the ear must be precisely detected. Secondly, since the acquisition process is not controlled, the orientation of the head can significantly vary. As a result, the need for detection methods arises, which will be able to cope with ear transformations.

Convolutional neural networks (CNNs) became a state-of-the-art solution of many image analysis problems [6,7]. Their main component is convolutional layers designed to apply the same, trainable filters (represented by a rectangular mask) locally to every part of the image. It allows extracting the spatial distribution of image characteristic features (so-called feature maps). These kinds of layers together with downsampling/upsampling mechanisms and classic fully connected layers allow solving many typical tasks. These are, among others: image classification, object localization and detection, semantic and instance segmentation, etc. However, despite unquestionable advantages, convolutional neural networks are not free of vices. First of all, they operate on pixels. Bearing in mind the dimensions of currently processed images, to solve practical problems, the structure of such networks must be quite complex (deep architectures). This results in a large number of parameters that need to be trained and, consequently, huge training datasets that need to be prepared. The second problem is their sensitivity to object rotations. If rotation invariance/equivariance is required, the complexity of the trained model must be further increased. Both of those problems can be overcome by the geometric deep learning (GDL) technique presented in this paper.
