*3.1. Assumptions*

To apply the proposed approach, the content of every image in the UBEAR dataset needed to be represented as a graph. For that purpose, first, images and binary masks with precise ear localization were scaled down to have only 0.25 of their original size. Next, superpixel detection was performed with the use of the SLIC algorithm [26]. The expected number of superpixels, which is a parameter of SLIC algorithm, was determined by their expected average area *A*. Two possible configurations were considered with *A* = 256 and *A* = 128. They led to around 300 and 600 superpixels per image respectively.

Having superpixels generated, two graphs were created: input graph and expected, output graph. Nodes of both graphs corresponded to superpixels. In the input graph, the feature vector assigned to node contained the average intensity of image pixels covered by a given superpixel (Figure 3a,b). In the expected, output graph, it was the average intensity of pixels taken from binary masks multiplied by scaling constant *W* > 0. It should be noted that in the latter case, values assigned to nodes need not be equal either to zero or *W* as the borders of superpixels need not coincide with the borders of the ear region (Figure 3c,d). Two possible values of scaling constant were considered. These were *W* = 1 and *W* = 100. This constant was introduced based on our earlier experience with classic CNN applications. Such a procedure allowed avoiding, during network training, local minima with all responses equal to zero.

Nodes in the considered graphs must be connected with directed edges. To determine which airs of nodes should be connected, first, the adjacency of all superpixels was examined. Two nodes were connected with an edge if there existed a path (path in this context means a sequence of superpixels) of length shorter than or equal to a given number *D*, connecting corresponding superpixels. Here, also two configurations were analyzed, where *D* = 1 and *D* = 2 (Figure 8). Selecting a higher value of *D*, we should be able to increase the size of the visual field, i.e., increase the number of input nodes, which influences the single output node. Loops connecting nodes were allowed because thanks to that the node could express its influence on the output assigned to this node.

After a series of initial trials, a network architecture with *L* = 4 layers *φ* was found to be the optimal one. The number of filter groups in those layers, and hence the number of output graphs, was equal to 20, 10, 5, and 1, respectively. The number of filters in the given group corresponded to the number of layer inputs. In the first layer, it was one, and in the subsequent layers, it depended on the output of the previous layer. In all layers except the last one, the ReLU activation function was used. The last layer had an identity activation function assigned. The number of GMM components in all filters *ϕ* was equal to *J* = 4. As before, while training, MSE loss, as well as the Adam optimizer were used. This time, however, a smaller learning rate, equal to 10−4, was considered. For weight initialization, the Glorot scheme was used [40].

To detect ears based on the output of the network, simply the node (superpixel) with the highest response was sought (Figure 9a,b). In order to evaluate if this detection was correct, it was checked whether the superpixel (its centroid) lied inside the bounding box surrounding the ear region. That rectangle was found using the original binary masks provided in the UBEAR dataset and was slightly enlarged to take into account the size of the superpixels (Figure 9c,d).

In all the experiments, images from the UBEAR dataset were split into three sets: training D*TR*, validation D*VA*, and test set D*TE*. The split was made made based on person identifiers, i.e., images of the same person were assigned always to the same set. This should allow checking if the trained models were able to generalize the acquired knowledge and respond correctly for new people. The validation set was used to prevent overfitting by the selection of the optimal model from among models created in the training phase. The number of people in the discussed sets was equal to 75, 25, and 26, respectively. Only left ears were considered since the proposed approach should work for right ears in the same way. What is more, mirror transformation of GMM filters should also allow applying the network trained with left ears to work for right ears, as well.

**Figure 8.** Graphs generated for images shown in Figure 3 for different superpixels' number and different node neighborhoods (parameters *A* and *D*, respectively): (**a**,**b**) full graph with all edges; (**c**,**e**) superpixels with the local neighborhood of the selected node; (**d**,**f**) selected graph node with its neighborhood. As the image scale was preserved, it can be observed that when *A* was smaller, the smaller image region was processed by CNN. Consequently, not only *D* but also *A* influenced the size of the effective visual field.

**Figure 9.** Detection results for images shown in Figure 3: (**a**,**b**) network output (it was scaled back using *W* and cut to [0, 255] interval); (**c**,**d**) detection visualization (the green rectangle represents the expected bounding box, and the red dot indicates selected superpixel; the red line shows the orientation of filters in the network, and the vertical line corresponds to basic orientation).

## *3.2. Experiment I*

In the first experiment, the CNN network was trained on a UBEAR subset D*TR <sup>M</sup>* containing only heads in their standard orientation (pose M in the UBEAR dataset). Every combination of parameters *A*, *D*, and *W* was tested to select the optimal one. The obtained results and cardinalities of the considered training, validation, and test sets are gathered in Table 2. In all cases, results were satisfactory with the correct detection rate bigger than 90%. It is worth noting that in three cases, the detection accuracy for the training set was equal to 100%. This seemed, however, to be slightly overfitted, and as the best configuration, the one with *A* = 256, *D* = 2, and *W* = 100 was indicated.

The closer analysis of the results revealed that, surprisingly, representation with *A* = 256 was not worse than representation with a bigger number of superpixels obtained when *A* = 128. It could be expected that in the latter case, when more details are given, the accuracy would increase. The explanation can be the fact that in both cases, the same network architecture was used, and consequently, for smaller *A*, the effective visual field was also smaller. What is more, in both cases, the same number 2000 of training iterations was used, and perhaps, more details required longer training. Nevertheless, since for *A* = 256, the results were satisfactory and, thanks to the simpler representation, graph processing was faster, this seemed to be a reasonable choice for the discussed problem. In the case of other parameters, the configurations with *D* = 2 and *W* = 100 seemed to lead to models with better generalization abilities. For them, the detection accuracy was higher when validation and test sets were considered. Those observations were also confirmed by the training characteristic depicted in Figure 10. For optimal values of the parameters (Figure 10a), the best model, the one with the smallest validation error, could be easily selected in the early stage of training. In other cases (Figure 10b), both errors seemed to decrease slowly, and maybe, further training could provide a better solution. Not without significance is also the random initialization of network weights.

**Table 2.** The ear detection accuracy of the networks trained using the <sup>D</sup>*TR <sup>M</sup>* set for different combinations of parameters *A*, *D*, and *W*. Additionally, the last column contains the cardinality of every considered set. It is worth noticing that networks trained using only samples for basic head orientation (pose M) can successfully detect ear not only for other people, but also for different orientations (poses U and D). Naturally, in the latter case, detection accuracy was significantly smaller.


In Figure 11, additional samples of good and wrong detections of the optimal network are presented. It can be observed that correct detections were possible in different illumination conditions and with different background, as well as in situations when there were additional objects located in the ear neighborhood. The typical reasons for the detection mistakes were: questionable annotations (head orientation slightly different than expected), position of the ear close at the image border, and highly occluded ears.

To have a better insight into the process of single graph processing, also the selected outputs of convolutional layers *φ* are presented in Figure 12. In classic CNNs the first layers are usually responsible for the detection of some local image characteristics. Although here, the interpretation

was not that obvious, that kind of behavior to a certain extent can also be observed. In Figure 12a, for example, the detection of vertical edges seemed to take place.

**Figure 10.** Training characteristics of two models. The plots present errors for training <sup>D</sup>*TR <sup>M</sup>* and validation <sup>D</sup>*VA <sup>M</sup>* sets that were calculated every 50 epochs. On the left, the training run for the best parameter combination is depicted. It can be observed that to select the optimal network, the model generated after 600 epochs should be chosen. On the right, another combination of parameters was used. This time, however, no epoch can be indicated where model overfitting seemed to take place. Probably, the training should be continued further.

**Figure 11.** Examples of detections for the best, trained network for images in <sup>D</sup>*VA <sup>M</sup>* and <sup>D</sup>*TE <sup>M</sup>* : (**a**–**c**) correct detections; (**d**–**f**) wrong detections. The network is able to respond correctly in different illumination conditions and for different backgrounds. Problems can be observed when ears are not fully visible (image border or occlusion) and when the head orientation is different than expected (wrong annotation). The convention of the result presentation was described in Figure 9's caption.

**Figure 12.** Selected, raw outputs of convolutional layers for the image shown in Figure 3a and the optimal network. In the first two layers (*φ*<sup>1</sup> and *φ*2), the person outlines can be still observed, so probably some local image characteristics are extracted here. This behavior is typical also for classic CNNs. In the final layer *φ*4, the output allows finding ear position. For visualization purposes, every output was scaled separately (they cannot be compared). Red color denotes negative and green positive values.
