*2.1. Dataset*

There are several publicly available benchmark datasets dedicated to ear biometric tasks. They differ, however, significantly. The CP dataset [33] and the IITD dataset [34,35] are very similar. Both contain grayscale, tightly cropped, and aligned images. The AMI dataset [36] and the WPUT dataset [37] are comprised of color images with ears and surrounding head fragments. In all those sets, images were acquired in controlled, laboratory conditions and can be used for training purposes and evaluation of ear identification systems. The last two sets described below were prepared in a different way. The UBEAR dataset [32] is composed of profile, grayscale images taken from video sequences (selected frames). Images in the AWE dataset [38] were collected from the web. They are better suited for ear detection problems as they contain also original, not cropped, pictures. It should be also emphasized that, except the first two, in all those datasets, to a different extent, some additional difficulties are present. To mention only a few of them: images were acquired in different illumination conditions and with different background; ears were occluded; head was rotated, leading to ear transformations, etc.

The UBEAR and AWE datasets together with sets dedicated to other problems (e.g., face recognition) are exploited in the ear detection literature. In this work, in all the experiments, only the UBEAR dataset was used. There are several reasons for that choice. First of all, we did not have access to the original, not cropped, AWE images. Secondly, it corresponded to our initial idea of an ear detection system where people could be identified discreetly from video sequences. Thirdly, this set is relatively large. It contains 4429 images taken from 126 persons. Fourthly, in this dataset, binary masks, indicating precise ear positions, are available, which is a rare case in these kinds of datasets (usually only bounding boxes are annotated). What is more, it contains information about head poses, which allows identifying images with a specific head orientation. There were 5 poses identified. Every pose had a unique letter assigned: M means that person was stepping ahead (normal head orientation), whereas U, D, O, and T indicate that the head was rotated upwards, downwards, outwards, and towards, respectively. Finally, it is a quite challenging dataset for analysis. Not only all the above mentioned difficulties were present, but also motion artifacts could be observed. Sample images from the UBEAR dataset are presented in Figure 3.

## *2.2. Method*

In this work, the existing GDL method, presented in [24], was further developed. That approach defined convolutional filters using Gaussian mixture model (GMM) in a pseudo-coordinate space. Assuming that nodes of input and output graphs are described with vectors in *N* and *M* dimensions (channels), respectively, the operation of a single convolutional layer *φ* can be expressed in the following way (Figure 4):

$$h^{\mathfrak{m}}(s) = \Psi \left( b^{\mathfrak{m}} + \sum\_{n=1}^{N} \sum\_{t \in \mathcal{N}(s)} \varphi^{n, \mathfrak{m}}(\mathbf{u}(s, t)) f^n(t) \right) \tag{1}$$

where:

$$\boldsymbol{\varrho}^{n,m}(\mathbf{u}) = \sum\_{j=1}^{l} \mathbf{g}\_{j}^{n,m} \exp\left(-\frac{1}{2} (\mathbf{u} - \boldsymbol{\mu}\_{j}^{n,m})^T (\mathbf{K}\_{j}^{n,m})^{-1} (\mathbf{u} - \boldsymbol{\mu}\_{j}^{n,m})\right) \tag{2}$$

and *m* = 1, ... , *M*. In the above equations, *s* and *t* are node indices, *f* and *h* represent vectors describing features of the input and output graphs' nodes, respectively, *J* denotes the number of Gaussians, and N

is the neighborhood function identifying adjacent nodes. Mapping **u** calculates pseudo-coordinates of node *t* relative to given node *s*. These coordinates are *d*-dimensional vectors. Finally, Ψ is an activation function applied element-wise for every graph node, and *b* represents additional, optional bias. Every convolutional layer defined in this way contains *M* groups of *N* filters *ϕ*. The trainable parameters of those filters are: real numbers *g*, vectors *μ* of size *d*, and diagonal *d* × *d* matrices **K** (only *d* non-zero elements). This gives a total number of parameters per filter equal to *J*(2*d* + 1) and in the whole layer equal to *MN*(*J*(2*d* + 1) + 1).

**Figure 3.** Sample images from the UBEAR dataset (pose M). Their content was described using superpixels with different average areas *A* of a single superpixel (here and in the whole work, the original images are not used on purpose to protect the identity of the depicted people): (**a**) original image, (**b**) binary mask with ear localization. The color assigned to every superpixel is the average color of covered pixels.

The above formulation differs slightly from the original one presented in [24]. In that paper it was not clear whether every pair of input and output channels had its own fully trainable filter. In our experiments we have used PyTorch Geometric library [39] where, in its earlier versions, some of the filter parameters were shared. The presented extension was added to the library by the authors of this work and is available in PyTorch Geometric starting from version 1.3.1.

To construct a network operating on graphs and useful for semantic segmentation tasks, we must ensure that input and output graphs have the same size and structure. In classic CNNs, to achieve that goal without excessive growth of the number of parameters, additional downsampling/upsampling blocks are used. Here, thanks to the reduced representation of the image content, it was sufficient to consider only a sequential composition of the above layers:

$$h = \Phi(f) = (\phi^L \circ \dots \circ \phi^1)(f) \tag{3}$$

where *L* denotes the number of layers. Naturally, the number of input *N* and output dimensions *M* (as long as they are consistent in successive layers), as well as activation functions Ψ may vary between layers.

**Figure 4.** The processing scheme of a single layer *φ*. Here, the input and output graphs are described with vectors in *N* = 3 and *M* = 2 dimensions, respectively. For illustration purposes, every dimension (channel) is shown separately, but of course, the graph structure is always the same. It can be observed that every pair of input *f <sup>n</sup>* and output *h<sup>m</sup>* channels has its own fully trainable GMM filter *ϕn*,*m*.

Every filter *ϕ* is described by the corresponding GMM in pseudo-coordinate space defined by mapping **u**. When Cartesian coordinates are used in the image plane (*d* = 2), GMM can be rotated around the origin of the coordinate system (0, 0) by an angle *θ*, resulting in a new filter *ϕθ*. If the original filter *ϕ* detects some specific node configurations in a graph, the filter *ϕθ* should have a high response for those nodes, which exhibit the same, but rotated characteristic of their neighborhood. Consequently, if all the filters in convolutional layers *φθ* are rotated, the whole network Φ*<sup>θ</sup>* should possess the same property. To confirm this hypothesis, a verification experiment, described in the next subsection, was proposed. This confirmation is crucial to facilitate the construction of the system possessing the rotation equivariance property shown in Figure 2. We can train then a network Φ<sup>0</sup> capable of recognizing ears for basic head pose (a smaller training set is required) and, after successive rotations of filters, use it to detect rotated ears as well.
