*3.3. 3-CNN Inference*

Training a single neural network and expecting it to be sufficient to properly tell apart ears from background noise in real-world imagery is quite the leap of faith.

In practice, a neural network of this type will be quite capable at properly recognizing the large majority of ear-shaped objects that are presented to it. Thus, when tested against a set of cut-out ear images specifically prepared for such a task of recognition, its true positive inference performance will be quite good. However, it will be prone to make many mistakes when presented with background images or noise. The network is trained with a BG class to help it learn the difference between an ear and background noise, but no matter how the training for this class is prepared, a CNN will always be prone to false detections simply due to the internal functionality of neural networks. There will always be patterns or combination of features that can be easily found on natural imagery which will randomly trigger internal neural paths and thus produce a large false positive rate as well—a type of artificial pareidolia. For real world purposes of image detection over large input image frames, this results in a large number of false hits. Table 4 describes this effect in more detail. A single CNN will very often detect the ear correctly (Ears Detected metric), both in close up images as in the AMI

dataset (99.70%), but also in the more challenging full image frames of the UBEAR dataset (93.90%). However, this metric disregards the effect of false positives. The F1 metric is useful to uncover the grea<sup>t</sup> performance disparity that occurs in reality. While, in the AMI dataset, the F1 value remains high (99.86%), in the UBEAR dataset it drops abysmally (41.46%) due to the very large number of false positives introduced.


**Table 4.** Single vs. 3-Convolutional Neural Network (CNN) inference performance, showing how both systems vary greatly when tested against different data types.

This problem can usually be addressed by creating ensemble systems consisting of multiple classifiers, each one different in a specific manner. They all analyze the same data input, and their different outputs are then combined to create a final result whose accuracy will usually be larger than that of any single classifier running by itself [4].

We apply a variation on this idea, in that we do not process all classifiers in the ensemble with the exact same input data, but rather we present different data to each component of the ensemble. Therefore, each of the classifiers must then be trained to specialize in the kind of data which will be presented to it. The different data inputs are carefully constructed so that each one carries meaning specific to that component according to its own specialization.

The main idea then is to feed to three neural networks three different images, each one corresponding to the same image region being analyzed but at different cropping scales. Figure 3 depicts the three different scales which are ingested by the triple classifier ensemble. We appropriately label each of the three networks used to analyze these as S, M, and L (for their corresponding size abbreviations).

**Figure 3.** The three scales that are used for every data point in the training dataset.

The purpose of the three scales is mainly to train specialized networks for the specific purposes of (i) recognizing the tubular features of the inner ear, (ii) framing the correct coordinates of the ear, and (iii) inferring ear context within a surrounding head region. Training a network with any single one of these scales would specialize it in that particular data, but the network would be oblivious to other natural image data with similar structure but not really belonging to a true ear, and thus leading it to produce a large number of false positives which would end up affecting the overall detection accuracy. However, the three networks working together as a committee of classifiers produces a much more robust result that is far more resilient against noise, as a true positive hit will require the activation of all three networks, simply by integrating contextual information into the system.

*Processes* **2019**, *7*, 457

Each of the three neural networks produces three output values, which correspond to the likelihood of each target class having been perceived in that network's input. We denote the output values as *O KA* , where *A* ∈ {*<sup>S</sup>*, *M*, *L*} represents the network index denoted by its size, and *K* ∈ {*LE*, *RE*, *BG*} represents the output class index of each network, for each of the possible detection outcomes. Each of these outputs will lie in the [−1, +<sup>1</sup>] range as the neural networks have been trained with those ideal values.

To combine the outputs of all three networks as a unified ensemble, we filter each class output with the corresponding values across all three networks, after each one has been linearly rectified. The final outputs of the ensemble are defined by:

$$O\_F^{LE} = [O\_S^{LE}]^+ \cdot [O\_M^{LE}]^+ \cdot [O\_L^{LE}]^+ \tag{2}$$

$$O\_F^{RE} = [O\_S^{RE}]^+ \cdot [O\_M^{RE}]^+ \cdot [O\_L^{RE}]^+ \tag{3}$$

$$
\rho\_F^{BG} = [O\_S^{BG}]^+ \cdot [O\_M^{BG}]^+ \cdot [O\_L^{BG}]^+ \tag{4}
$$

where *x*<sup>+</sup> ≡ *max*(0, *<sup>x</sup>*), is a linear rectification operation. By passing through only the positive values of each interim output, we avoid interference from multiple negative values, any of which then has the effect of zeroing the final output. Figure 4 depicts the process visually.

**Figure 4.** Data flow in the inference process of 3-CNN detections.

The net effect of this process, then, is to have all three networks work in tandem, where only the regions for which all three networks are in full agreemen<sup>t</sup> will survive. Furthermore, the final output will be weighed by the individual network certainty, and thus regions where all three networks have a high likelihood output will outweigh regions where the output distribution is more disparate.
