*Article* **Unsupervised Cyclic Siamese Networks Automating Cell Imagery Analysis**

**Dominik Stallmann \* and Barbara Hammer \***

Faculty of Technology, University of Bielefeld, Universitätsstraße 25, 33615 Bielefeld, Germany

**\*** Correspondence: dstallmann@techfak.uni-bielefeld.de (D.S.); bhammer@techfak.uni-bielefeld.de (B.H.)

**Abstract:** Novel neural network models that can handle complex tasks with fewer examples than before are being developed for a wide range of applications. In some fields, even the creation of a few labels is a laborious task and impractical, especially for data that require more than a few seconds to generate each label. In the biotechnological domain, cell cultivation experiments are usually done by varying the circumstances of the experiments, seldom in such a way that hand-labeled data of one experiment cannot be used in others. In this field, exact cell counts are required for analysis, and even by modern standards, semi-supervised models typically need hundreds of labels to achieve acceptable accuracy on this task, while classical image processing yields unsatisfactory results. We research whether an unsupervised learning scheme is able to accomplish this task without manual labeling of the given data. We present a VAE-based Siamese architecture that is expanded in a cyclic fashion to allow the use of labeled synthetic data. In particular, we focus on generating pseudonatural images from synthetic images for which the target variable is known to mimic the existence of labeled natural data. We show that this learning scheme provides reliable estimates for multiple microscopy technologies and for unseen data sets without manual labeling. We provide the source code as well as the data we use. The code package is open source and free to use (MIT licensed).

**Keywords:** Siamese networks; synthetic data; cyclic learning; unsupervised learning; deep learning; data augmentation; single cell cultivation; bioimage analysis

#### **1. Introduction**

Single cell cultivation is one of the most important steps in single cell analysis [1] and represents an essential means to better understand cell functionality from cellular and subcellular perspectives for diagnosis and therapy, and microfluidic devices constitute fast-rising systems for efficient single cell cultivation. However, the analysis of microfluidic single cell cultivation (MSCC) microscopic images is usually performed manually or supported by technological aiding systems, but requires the work of human experts because of the high spatial and temporal resolution and a variety of visual characteristics that make automation difficult. Flexible image processing pipelines have proven their relevance for certain setups, but are limited to specific scenarios and partially interactive, as the fully automated analysis of non-adhesive cells in the presence of the varying light conditions and various artifacts of microscopic images is challenging [2].

In recent years, the potential of deep convolutional architectures for automated and flexible image analysis has been demonstrated in this area, but training procedures for current deep architectures rely, at least partially, on manually labeled training data [3,4]. A manual procedure is not practical in many applications, creating a demand for effective, fully automated solutions [5]. Therefore, the particular focus of this work is to eradicate the human expert requirement for annotations completely.

Henceforth, we will focus on a relevant generic learning task for MSCC image analysis: the cell count is used as the target variable, which has to be estimated reliably at any point in time of the experiment and is chosen mainly for two reasons: (1) it allows for the

**Citation:** Stallmann, D.; Hammer, B. Unsupervised Cyclic Siamese Networks Automating Cell Imagery Analysis. *Algorithms* **2023**, *16*, 205. https://doi.org/10.3390/a16040205

Academic Editor: Xiang Zhang and Xiaoxiao Li

Received: 27 February 2023 Revised: 29 March 2023 Accepted: 4 April 2023 Published: 12 April 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

extrapolation of other important attributes of the experiment, such as the growth delta over the last few time segments, as well as the overall growth rate, and (2) as a regression task, it is known to be especially difficult to be estimated accurately for unsupervised training methods, i.e., it can be inferred that tasks that are generally considered more simple, such as classification or segmentation, can also be solved with the methodology presented in Section 3.

In the following, we will aim for a solution that does not rely on any manually generated labels. Instead, we will rely on automatically generated artificial labels, i.e., use "fully automated supervision". To prevent misunderstandings, unsupervised deep learning would, in its most exclusive definition, not be able to solve the addressed task, since the lack of labels means that the regression loss cannot be calculated. Therefore, we refer to "unsupervised learning" for this task as the absence of manually curated labels for the experimental data. There needs to be a computable loss on the target variable to achieve actual training, which, in our case, can explicitly and efficiently be defined, based on the available symbolic semantics for auxiliary synthetic data.

Even self-training architectures such as Generative Adversarial Networks (GAN) and Variational Autoencoders (VAE) can only generate losses on predictions and reconstructions of the data, not on the target variable. The Siamese-like architecture described later will therefore not only train on natural data, created by the biotechnological experiments, but also on a collection of synthetic auxiliary data with automatically generated labels and therefore known ground truth. By training this architecture with a special learning scheme, it is not only possible to perform regression learning on the target variable, but also to achieve accuracy that approaches or, in some cases, exceeds the state of the art (see Section 4).

While our own previous work [4] will serve as a basis for the later comparison of results, we would like to clarify the differences between that work and this one in terms of approaches and goals. The novelty of [4] is state of the art accuracy in the domain of semi-supervised cell counting, achieved by transferring a pre-trained model to another type of microscopy data. Due to optimizations in the transfer process, the architecture presented there has also slightly outperformed the previous state of the art. In this work, we instead focus on unsupervised training with the modification of generating pseudosynthetic images from natural images (and vice versa) in order to use the well-trained regressor that is accustomed to synthetic data representations. The earlier work would not be able to achieve meaningful regression for the fully unlabeled natural data used in this work because the loss of the regressor would not be defined for natural data.

Figure 1 shows examples from the MSCC experiments that we address in the following. It can be seen that lab-on-a-chip technology is used and that the data have a number of visual aspects that make them difficult for classical image processing solutions and nonspecialized machine learning models to process. Namely, these are as follows:


**Figure 1.** Samples from the data sets of CHO-K1 suspension growth. Bright-field microscopy image on the right, phase-contrast microscopy image on the left. Smudges on the chip can be seen in the form of faint, small circles within the fluid solution. The scale bars do not appear in the working data.

In this article, we propose a novel training scheme for a Siamese deep learning model that can optimally combine information provided by automatically generated synthetic data and real images such that no manual labeling of natural data is required. The contribution and novelty of this work are as follows:


In the following sections, we first give an overview of the current state of the art in this research field and take a brief look at previous works in this field of application. In Section 3, we address the underlying machine learning challenge and present our deep Siamese network architecture in detail. Then, the details of the proposed learning procedure are explained and it is analyzed how the unique architecture used affects the learning procedure. Thereafter, Section 4 contains the evaluation for real data sets and ablation studies, as well as the comparison to state of the art alternatives and baselines. Lastly, in Section 5, a discussion followed by a conclusion (see Section 6) completes the contribution.

#### **2. Related Work**

In the last few years, convolutional deep neural networks have become the state of the art for image processing that does not require human labor and for the majority of other computer vision tasks [7]. Especially for the task of counting in images, solutions have been worked on for over a decade now (see [8]). Applications in the biomedical domain have become common [9] and cell tracking approaches in images have been an ongoing field of study in recent years [10]. However, the optimization of such methods is often time-consuming and remains prone to errors.

Ulman et al. [11] propose a benchmark suite to compare different imaging technologies and extrapolate the strengths and limitations of different approaches to cell tracking, none of which have been determined as a final solution on this task, even the ones including interactions among bioimage analysis experts [12] or the distributed work of manual labeling [13]. Schmitz et al. [14] show the demand for fleshed out solutions by evaluating the currently used state of the art tools as insufficient for heterogeneity studies of the CHO-K1 mammalian cells that are present in the given data.

In addition, Brent et al. [15] used transfer learning to predict microscope images between different imaging technologies, but without sufficiently accounting for the wide variety of cell images and features. The approach by Falk et al. [16] provides one of the few toolboxes for cell tracking, albeit for adherent rather than suspension cells. It allows for transfer learning based on given models and novel data, whereby data set enrichment technologies limit the number of required samples.

In contrast to adherent cell lines, where already reported single cell cultivation studies [17,18] promise success, we address the more complex scenario of suspension cells with all their visual characteristics listed above, rendering analysis tools of adherent cells deficient. Earlier works have overcome some of the challenges, such as sufficient counting accuracy, by interactive design [19], or detecting overlapping instances in such imagery [20], but they are not yet sufficient for the unsupervised task at hand. Different contrast and light conditions have been addressed by Chen et al. [21]. The adherence of cells and overlaps have been addressed by Xie et al. [22], but additional visual features complicate the process and reduce the applicability of previous solutions.

Siamese networks have been used for a variety of tasks as they can help to facilitate fewshot learning or clustering of the data space by generalizing from unlabeled data. This is done in [23] for genome sequencing and in [24] for text data. These presented architectures are, however, specific to their domains and not applicable to image processing.

There are also Siamese networks that do work in the image processing domain, such as [25], but they focus on change detection as a binary segmentation, suitable for tracking single cells, but not for the regression task at hand. Ref. [26] uses Siamese networks and data augmentation, similar to our approach, but the training is supervised and addresses a four-class classification task. In [27], similar data augmentation and Siamese networks were used and the 20-class classification is closest to the regression task that we address, but the networks used are non-generative CNNs and the data are not used cyclically, rendering it not applicable for our work.

Furthermore, there are no deep learning models that easily and efficiently solve the task, as shown in [3] by comparing the recent state of the art EfficientNet [28] and classical image processing such as Watershed methods [29], and transfer models such as BigTransfer [30] are not reliably able to generate good cell counts by transferring a pretrained model to this domain, as can be seen in our earlier work [4].

Deepak Babu et al. [31] achieved acceptable accuracy for the regression task of crowd counting, a similar task; however, the training was semi-supervised. More generalized few-shot and even zero-shot learning has been done by Schönfeld et al. [32] by using aligned VAEs, achieving high precision, but only on the few-shot tasks, not the zero-shot ones. In our approach, we will fully focus on the idea of the integration of synthetic data, which can itself harvest its semantically meaningful generation, to avoid any additional manual labeling of natural data for training, therefore rendering even these related results insufficient.

Synthetic data have already been used in [33,34], but for natural scene and text recognition, or computer vision tasks more generally, mostly natural domains where powerful deep generative models can build on massive amounts of publicly available data. In contrast, we are interested in synthetic data that are prone to a reality gap due to the limited availability of natural data. In semi-supervised learning, models are often enriched by easily available unlabeled data that describe the underlying input distribution [35]. A view into when unlabeled data can improve the learning rate has been taken by Göpfert et al. [36], suggesting the usage of additional unlabeled data, be it synthetic or natural, as beneficial, confirmed for this case in Section 4. The impact of variability in auxiliary training data on convolutional networks specifically was tested in [37], but for 3D head reconstruction, not intrinsically usable in this domain.

The weight sharing used in our particular learning scheme was used previously to decrease network sizes and improve test and verification performance [38]. In Section 3.3, we show details on the specialized usage of this technique for our architecture.

Lastly, Uniform Manifold Approximation and Projection (UMAP) [39] is used to project the inner state of the network into a two-dimensional representation, allowing us to obtain a glance at the internal state of the latent representation and insight into how the data are processed. In Section 6, such a UMAP is discussed for interpretation.

#### **3. Methodology**

#### *3.1. Natural Data*

Image data applied in this study were obtained by MSCC of mammalian suspension cells, as introduced before in the literature [40]. The CHO-K1 cells were cultivated in polydimethylsiloxane (PDMS) glass chips. Perfusion of the device constantly provided the cultures with nutrients. An automated inverted microscope performed the live cell imaging, taking images of the relevant positions on-chip every 20 min. The data used in this work are split into two major parts according to the two microscopy technologies, namely bright-field microscopy and phase-contrast microscopy, abbreviated as BF and PC, which were used for the analysis of the architecture. Figure 2 shows example data from both microscopy technologies after the application of the preprocessing described below.

**Figure 2.** Samples from the natural data sets after application of various data enrichment techniques, described below. Phase-contrast technology on the left, bright-field technology on the right. The image resolution equals the working resolution.

Around 10,000 images were taken over the course of the experiments per microscopy technology; then, images of empty and fully filled cell chambers were removed, since, for these, the experiment had not started yet or the outcome of the experiment was already determined, respectively. In total, 2983 BF images and 3944 PC images remained relevant for the machine learning task. Around 20% of the data were labeled by hand exclusively for testing and will from here on be called Nat-L-Te (natural, labeled test data); the other 80 percent remain unlabeled and are used for training and called Nat-U-Tr (natural, unlabeled training data). The test data were split in half to obtain a verification data set and to prevent accidental specialized training on the test data over the course of the hyperparameter optimization. During the test data selection process, we ensured that full experimental runs as well as randomly picked images from the various experiment series were part of the test and verification data. Table 1 gives an overview over the different types of data sets used in our work.

**Table 1.** Overview of all data sets used. Nat-U-Tr contains natural, unlabeled training data; Nat-L-Te natural, labeled test data; Syn-L-Tr synthetic, labeled training data, and Syn-L-Te contains synthetic, labeled test data. For reasons described in Section 3.2, synthetic data have been generated in a 1:1 ratio to natural data. Nat-PC refers to all natural phase-contrast images (i.e., Nat-L-Te and Nat-U-Tr) and Nat-BF refers to all natural bright-field images, respectively. Syn-PC and Syn-BF denote the groups of training and test data for phase-contrast and bright-field data accordingly, and, lastly, Nat and Syn denote the full set of natural and synthetic data.


We crop and rotate all images to center the cultivation chamber. Further data augmentation beyond this preprocessing is described in Section 3.2. We place our focus on the larger data set called Nat-PC from here on. It contains more experimental samples and the biological processes covered are more diverse. In addition, phase-contrast microscopy is more popular, and we will nonetheless show that our method also works reliably on the smaller Nat-BF data set, although the variations in cell positions, numbers, and sizes are lower in this data set and therefore the quality of these images is lower in terms of machine learning, similar to what might be the case for entirely different types of cells, such as plant cells.

To make the best use of the poor amount of experimental data, the following enrichment techniques are applied to the data. Flips along both image axes are followed by a random crop up to the edges of the cell chamber, not cropping away cells, except for the entrance tunnel, where precise cell detection is not required. The crop does complicate cell detection, as cells may be in positions where only the chamber rim and the outside of the chamber would be without the crop, but it proved necessary to allow cells to appear anywhere on the images to ensure uniform detection success that is barely affected by the position of the cells within an image. Then, a random rotation of 90◦ is performed and a randomly generated noise map is multiplied by a small weighting factor and applied to the image to simulate more fluctuation in the cells' visuals, since occasionally there are dead cells in the experimental data that do not change in appearance for multiple images. All augmentations are reapplied to the original data for every epoch of training with seed consistency to ensure reproducibility.

#### *3.2. Synthetic Data*

We propose a novel learning scheme in Section 3.3.2 that deals with synthetic data with known ground truth (i.e., the cell count) and a Siamese architecture that can abstract from the fact that the auxiliary data are synthetic. In addition to the common data set enrichment, generating proxy data allows us to create a wide variety of synthetic samples, which are inspired by the natural data, but not limited by their amount or variety.

By enriching the training procedure with synthetic data, we extinguish the need for natural labeled data. Synthetic data are easily obtained in this setting because the architecture does not require that the images are rendered realistically in all respects, such as morphological details. The 128 × 128 working resolution of the architecture makes the synthetic data generation undemanding, while maintaining sufficient intricacy of visual features such as overlapping (see Figure 3 left). For the specialized training procedure described below, we do not need to synthetically create images that are indistinguishable from natural ones, unlike current data augmentation schemes, such as proposed in the work [41]. This would require a considerate amount of engineering [37], i.e., human expert labor, exactly what we aim to mitigate. We rely merely on modeling simple ellipsoidal shapes to embody cells, ignoring details of the texture and the intricate morphology of real suspension cells. We imposed this limitation on ourselves to suggest that the learning procedure presented below should also work with other types of image data and is neither tailor-made for exactly these microscopy technologies nor requires extensive manual work to generate the most realistic synthetic data possible. In Section 3.3, we show that this approach is adequate for training our architecture described.

**Figure 3.** Examples of synthetic data. Syn-PC imagery on the left, Syn-BF imagery on the right. Backgrounds were generated by averaging over natural, nearly empty chamber images (including smudges) and cells are approximated by simple geometric ellipses, but given some of the intricate visual characteristics of natural cells, such as overlapping and differing luminosity, while factors that explicitly only hinder the architecture, such as cells escaping through the chamber funnels and complex visual features such as the inner organelles of cells, have not been recreated.

We ensured that the distribution of cell counts in the auxiliary data was sufficiently close, but not necessarily identical to that of the natural data sets. This allows for an unlimited amount of labeled training data, with only the processing time being the limiting factor for the potential to use enormous amounts of proxy data, not the availability of such. One problem remains, however, which is how to actually improve the regression performance on natural data. Using a large ratio of synthetic data compared to natural data would entail a separation of the two types of data in the inner representation of the network, resulting in high accuracy on the synthetic data, but low accuracy on the natural data (see Section 4). To prevent this separation, two major functionalities are proposed and have been implemented, described in more detail in the following paragraph.

The auxiliary data generator is highly adjustable and produces imagery with a given distribution of cells. As background images, we calculate the mean of the first 20% of data from the experimental series, expecting cell counts to be low and cells to be scattered, so that the background has no visible natural cells in it. The generator takes control of the overlaps, brightness, and blurriness of the cells' inner organelles as well as their membranes, the contrast with the background, a range of possible cell sizes, counts, and crop values, as well as the ellipsoidal deformation range as parameters. All these can be chosen by hand within the code package, or the default values can be used. Combined, these operations can be used to imitate most of the intricate features of the real data, such as ongoing divisions of cells, by requesting a small overlap along with noisy cell boundaries. Smudges, as in Figure 2, are not included because they are a confounding factor and are assumed to only hinder the training process. The cells have been given a roughly circular shape to approximately match the shape of the natural cells. To generate cells, positions are sampled randomly from the valid space, taking the parameter of possible overlaps into account, and

are then randomly stretched, deformed, made noisy, and so on according to the chosen parameters; then, brightness fluctuation and Gaussian filters of varying strengths are added to increase the variety of cells in the data. This geometric form can easily be adjusted if natural cells in other data sets have different shape characteristics or when other camera setups produce different ambiences.

This data are generated fully automatically based on simple algorithmic principles and, as a baseline, a ratio between synthetic and natural data of 1:1 is used, since larger amounts increase the training time almost linearly, while the performance improves only with diminishing returns in our experiments. More details on this are given in Section 4. The imagery is produced algorithmically with seed consistency and can therefore be reproduced similarly to the data enrichment on the natural data and can be generated in an arbitrary amount.

#### *3.3. Architecture and Learning Scheme*

#### 3.3.1. Architecture

Our aim is to provide reliable cell counting for the microscopic imaging of suspension cells, and since the experimental data are limited in their amount and without annotations, we assemble a novel learning scheme for the Twin-VAE architecture previously introduced by us to overcome these limitations.

The architecture circumvents the problem of differences in the appearance of auxiliary and real data by separating the data input for training according to their origin, but requires that the model creates a tightly coupled joint inner representation to avoid high training losses. This is realized by modifying a Variational Autoencoder (VAE), duplicating the outer layers of the encoder and decoder, accounting for the two data sets. Therefore, the weights of the inner layers of both encoder and decoder are shared, as well as the semantic bottleneck in between (see Figure 4). We decided to choose this architecture for the reasons mentioned in Section 2.

The specialized encoders consist of four two-dimensional convolutional layers with kernel sizes of 5 and strides of 2. They are initialized with an orthogonal basis [42]. In between layers, leaky rectified linear units (LReLUs) with a leakiness of 0.2 and a dropout of 0.1 have been added. The channels used for the convolutions in the encoders in order are 32, 64, 128, and 256. The weight-shared encoder contains a single two-dimensional convolutional layer with the same remaining attributes but 512 channels. It is followed by the bottleneck, consisting of three layers of fully connected neurons. The layer sizes are 512, 256, and 512, each with the same dropout as before. The weight-shared decoder therefore also has 512 channels and uses a two-dimensional transposed convolutional operator layer with identical strides and kernel sizes as above, followed by a batch normalization over a four-dimensional input and another LReLU with the same leakiness. The decoders designed for specific data each consist of a total of five layers with kernel sizes 5, 5, 5, 2, 6, and strides of 2, 2, 2, 1, 2, following the convention of a smaller second to last kernel followed by a large last kernel. Then, we include the same LReLUs and a sigmoidal activation function at the end.

The representation in the latent space is not only fed to to the weight-shared decoder, but also to a three-layer fully connected network of neurons as a regressor. The sizes of the layers are 256 and 128 and lastly 1. Linear layers and a dropout of 0.2 are used for the regressor. The rectified Adam (RAdam) [43] optimizer worked best for the training procedure.

**Figure 4.** Visualization of the Siamese-Cycle-VAE (SC-VAE) architecture. The blue elements represent synthetic data handling, yellow elements depict natural data handling. Green elements are shared by the two VAEs and contain the inner representation of the cell imagery; purple elements result in an estimation for the cell count. The example images that are outlined are samples from the data sets on the left, with their respective results shown on the right. The translated images outlined with color transitioning have been generated from the opposite data type and are of particular interest, as well as the blue and yellow arrows pointing from right to left that indicate the reuse of decoded images. The examples at the very top left and bottom right are of the utmost importance, since they show the conversion of a synthetic image to a natural-looking one, which can then be used as a labeled pseudo-natural image for training of the regressor with natural-looking images.

One of the VAEs works on proxy data, and we will refer to it as VAE-syn, while the other one processes natural data (VAE-nat). The differing visual features of proxy and real data are accounted for in the separated layers, while the weight-shared encoder and decoder rely on and enforce a similar representation of the determinant image characteristics. In addition to auto-encoding, the architecture works on data with known labels in a supervised manner by the addition of a three-layer fully connected neural network regression model for the actual cell counting, based on the shared representation of the VAEs.

#### 3.3.2. Learning Scheme

For images *x* of either natural or synthetic type *t* ∈ {n, s}, the VAEs are able to generate reconstruction losses Rec(*x*, *y*) from reconstructed images *y* of their decoder. *C<sup>t</sup>* Rec are handcrafted weighting factors to balance the different reconstruction costs. Choosing these weights to be large results in better reconstruction but worse regression. However, proper reconstruction quality is required to fabricate well-trained encoders, thus demanding the factors to not be too low. The loss for the reconstructions is defined as

$$\text{REC}\_{\text{loss}}(\mathbf{x}, t) = \mathbb{C}\_{\text{Rec}}^{\text{v}} \cdot \text{Rec}(\mathbf{x}\_{n}, y\_{n}) + \mathbb{C}\_{\text{Rec}}^{\text{s}} \cdot \text{Rec}(\mathbf{x}\_{s}, y\_{s}) \tag{1}$$

For synthetic data with cell counts *l* from 1 to 30, we can also generate a regression loss Reg(*x*s, *l*). However, Reg(*x*n, *l*) cannot be calculated usually, since *l* is not known for these. In Section 4, our ablation studies show that this is insufficient for effective regression on natural data. The internal representations of the two types of images are naturally being separated in the bottleneck, precisely what VAEs are usually known and used for, resulting in high precision for synthetic data, but nearly arbitrary cell counts for natural data.

The specialized architecture allows an additional learning scheme to generate a loss for pseudo-natural data. This is done by encoding synthetic data in their specialized encoder, but decoding them with the decoder designed for natural data. This translation works both ways and will result in images *x*s→<sup>n</sup> and *x*n→s.

This new type of data can now be used in the natural pipeline, creating new reconstruction losses Rec(*x*s→n, *y*s→n), which can be used to train the according encoder and decoder, especially enriching the data available for the natural pipeline immensely.

These images will be called translated or cycled images from here on and they expand the usable image types to *t* ∈ {n, s, s → n, n → s}. Examples of translated images and a pipeline of their generation can be seen in Figure 4. Cycled images also generate reconstruction losses, which are defined as

$$\text{REC-T}\_{\text{loss}}(\mathbf{x}, t) = \mathbf{C}\_{\text{Rec}}^{\mathbf{u} \rightarrow s} \cdot \text{Rec}(\mathbf{x}\_{n \rightarrow s}, y\_{n \rightarrow s}) + \mathbf{C}\_{\text{Rec}}^{s \rightarrow n} \cdot \text{Rec}(\mathbf{x}\_{s \rightarrow n}, y\_{s \rightarrow n}) \tag{2}$$

These images do not exactly resemble natural images and are distinguishable from them by the human eye, but they are actually close enough in their relevant characteristics to natural images that when designing the learning scheme in the way described below, they are not distinguished as fake natural images by the architecture, a beneficial circumstance that allows the simulation of labeled natural data and shared representations, which becomes more clear when taking a look at the UMAP of the internal representation later in Section 4.3.

This process also leads us to translated natural images, for which the label is known, and therefore allows for the generation of the regression loss Reg(*x*s→n, *l*) = 0. This way, we can train the full regression pipeline for natural data, without any labeled natural data at all. Henceforth, we refer to this process as translation learning.

Furthermore, we can translate the same images again, leading to two new types of images yet again *t* ∈ {*x*s→n→s, *x*n→s→n}, which should appear near-identical to the original reconstruction *y*. We first designed this difference to be a loss as well, but we later omitted this training step for hyperparameter optimization, as it did not improving the accuracy on cell counts significantly while adding another step of the more demanding image backpropagation to the pipeline. However, we still create these bilateral translations for specialized top-performing models (see Table 2) and for reasons mentioned below. Since the cycling of data through the different types is what allows the architecture to perform a regression task on unlabeled natural data, we call it Siamese-Cycle-VAE or SC-VAE for short, and variants with enabled bilateral learning cycles will be called SC-VAE-B from here on.

**Table 2.** Evaluation of all baselines and SC-VAE on the data sets Nat-PC, Nat-BF, Syn-PC and Syn-BF. For each method and data set, we report the mean absolute (MAE), the mean relative error (MRE), and the accuracy. Ultimately, only performance on natural data (Nat) is important, but we also report the performance on synthetic data (Syn) to provide further context. We use an upward arrow ↑ to indicate that higher is better; a downward arrow ↓ means lower is better. The best results achieved per category are marked in bold, *(ss)* denotes a semi-supervised method, *(u)* an unsupervised method.



**Table 2.** *Cont.*

As mentioned above, we also generate pseudo-synthetic data *x*n→<sup>s</sup> from natural data (blue arrow in Figure 4). Since, for these pseudo-data, annotations are unknown, they cannot be used to train the regression process, but they can be used for two different purposes.

The first is balancing out the encoders and decoders, since, with the learning scheme described above, the synthetic pipeline will go through more training steps than the natural one, although this is the one that should be especially well-trained, as the minimization of regression losses on natural data is the actual goal of this learning scheme. In this way, the natural training pipeline can also be trained on many more cell arrangements than the few that natural images provide, since even with a multitude of data augmentation techniques, the generalization of encoding and decoding can be improved by this step (see Section 4).

Secondly, the decodings of translated synthetic images *y*n→<sup>s</sup> can be used as stability checks of the latent space for the different types of data. Badly decoded pseudo-synthetic images imply a larger than wanted differentiation of natural and synthetic images in the bottleneck. More on this is given in Section 4.1.

Considering the loss functions, let *r*(*x*) be the estimated cell count and *l* remain the label. The mean-squared error (MSE) ||*r*(*x*) <sup>−</sup> *<sup>l</sup>*||<sup>2</sup> and the binary cross-entropy (BCE) −*l* · *log*(*r*(*x*)) + (1 − *l*) · *log*(1 − *r*(*x*)) yielded similar results as in our previous works, and both resulted in more precise cell counts than common alternatives; therefore, extensive testing has been done with both, but ultimately the MSE was chosen as the default, since it is easier to find appropriate coefficients for the different types of losses due to the diminishing nature of MSE. The weight factors determine the importance of the counting accuracy and change over the course of the training procedure, since deriving accurate cell counts on natural data from synthetic and translated data requires preceding training of the encoders and decoders. The associated *REGloss*(*x*, *y*) term is defined as

$$\text{REG}\_{\text{loss}}(\mathbf{x}, l, \mathbf{t}) = \mathbf{C}\_{\text{Reg}}^{\mathbf{s}, \mathbf{l}} \cdot \text{Reg}(\mathbf{x}\_{\mathbf{s}}, l) + \mathbf{C}\_{\text{Reg}}^{\mathbf{s} \to \mathbf{n}, \mathbf{l}} \cdot \text{Reg}(\mathbf{x}\_{\mathbf{s} \to \mathbf{n}, \mathbf{l}}, l) \tag{3}$$

When using BCE, the decoder loss factors decays over time with a decaying rate of <sup>3</sup> <sup>×</sup> <sup>10</sup>−<sup>5</sup> per epoch. This is necessary because the BCE does not decrease significantly during training, but needs to diminish over time to increase the importance of low regression losses Reg(*x*, *l*).

Since it is beneficial for the prevention of overfitting to generate latent vectors that are sufficiently close to a normal distribution, we aim for homogeneous representations of synthetic and natural data in the embedding space of the architecture by applying a regularization cost DKL, which is applied in the form of the Kullback–Leibler divergence (KLD) of the standard VAE [44]. This loss will also ensure that the inner representations of natural, synthetic, and both types of cycled data stay similar, allowing us to use the special training procedure described above. This cost is applied for natural, synthetic, and both types of translated data and is defined as follows:

$$\text{KL}D\_{\text{loss}}(\mathbf{x}, \mathbf{t}) = \mathsf{C}\_{\mathcal{D}\_{\text{KL}}}^{\mathsf{u}, \mathsf{n} \rightarrow \mathsf{s}} \cdot \mathcal{D}\_{\text{KL}}(\mathbf{x}\_{\mathsf{n}, \mathsf{n} \rightarrow \mathsf{s}}) + \mathsf{C}\_{\mathcal{D}\_{\text{KL}}}^{\mathsf{s}, \mathsf{s} \rightarrow \mathsf{n}} \cdot \mathcal{D}\_{\text{KL}}(\mathbf{x}\_{\mathsf{s}, \mathsf{s} \rightarrow \mathsf{n}}) \tag{4}$$

All coefficient factors have to be chosen mindfully, balancing the main target of punishing incorrect cell counts on natural data and relaxing the importance of details in visual reconstruction, but not undervaluing the KLD at the same time. Doing so can make the training procedure unstable, while applying very large regularization costs hinders the learning process and slows it down. To minimize the number of hyperparameters that have to be optimized by hand, the weighting factors for the DKL losses have been grouped and a Bayesian optimization [45] in the form of a Gaussian process regressor [46] was used to quickly find baseline values for the most important hyperparameters, such as the learning rate and the loss weight factors.

We combine these losses to form our overall SCVAEloss(*x*, *l*, *t*), use the coefficients of the different terms to balance the impacts between natural, synthetic, and both types of translated images, and handle input images with missing cell counts by fixing *C*n,*<sup>l</sup>* Reg <sup>=</sup> *<sup>C</sup>*n→s,*<sup>l</sup>* Reg = 0:

$$\text{SCVAE}\_{\text{loss}}(\mathbf{x}, l, t) = RE\mathbb{C}\_{\text{loss}}(\mathbf{x}, t) + RE\mathbb{C} \cdot T\_{\text{loss}}(\mathbf{x}, t) + RE\mathbb{G}\_{\text{loss}}(\mathbf{x}, l, t) + KLD\_{\text{loss}}(\mathbf{x}, t) \tag{5}$$

#### 3.3.3. Baselines

For the evaluation in the upcoming section, several baselines have been gathered, to enable a meaningful comparison with the state of the art. The first baseline is a widely practiced classical computer vision pipeline. First, the input images are cropped to only contain the cell chamber, and are then blurred with an averaging kernel-based filter; then, a thresholding filter is applied, followed by a watershed segmentation [29]. The regions of the segmented image are counted and used as a cell estimation. In order to find suitable parameters for this learning scheme, an exhaustive grid search was performed for each data set BF and PC. The code repository contains the best hyperparameters found. We refer to this pipeline as Watershed in the following.

As a second baseline, we fine-tuned a pre-trained state of the art deep convolution neural network, specifically a variant of EfficientNet [28]. We replace the last layer of the pre-trained network with a fully connected layer that outputs a single value, and train it to predict the cell count for a given input image. We apply the same hyperparameter optimization as for our own method, and generate the same data augmentation. Since EfficientNet is a variable architecture that comes in different sizes, referred to as EfficientNet-B0, EfficientNet-B1, and so on, we evaluated EfficientNet-B0 through EfficientNet-B3 and found that the smallest variant EfficientNet-B0 performed best, while larger variants performed progressively worse. We considered to instead use EfficientNetV2 [47], but our preliminary results showed that the same performance degradation applies to its larger variants as well, and since EfficientNet-B0 outperformed the smallest EfficientNetV2-S variant, we retained it and refer to this fine-tuned convolutional neural network as EfficientNet hereafter.

As a third baseline, we compare a state of the art transfer learning model from Kolsenikov et al. called BiT, which produces highly accurate classification results on Cifar-100 and similar data sets in a few-shot learning case of 1 to 10 examples per class. BiT consists of the classical ResNet [48] architecture, but with very long pre-training times on large image sets and a custom hyperrule that determines the training time and learning rate during transfer depending on the size of the new data set. Changes to the hyperrule were tested, but did not cause any significant improvement in accuracy; therefore, the values provided by the authors were used. BiT is given all the natural and synthetic training data per epoch, so it can come up with meaningful cell counts on natural data by abstracting from the labeled synthetic data. We valued the possible cell counts from 1 to 30 as classes, to account for the difference in training methodology.

In addition, we compare our own previous work Twin-VAE (see [3]) and its alterations Transfer Twin-VAE and Dual Transfer Twin-VAE (see [4]). These are based on the same architecture, but perform semi-supervised learning techniques for which the same data are used, albeit with partial annotations on the natural training data of 5–10%. Although this circumstance should allow for higher accuracy on the counting task, the optimized pipeline and cyclic data reuse of the new Siamese-Cycle-VAE is able to keep up with and in some cases even outperform its predecessors, despite not being given any manual labels at all. More on this is given below.

Lastly, ablation studies are done to ensure and show that the specific architectural details and principles of the learning scheme are helpful and optimize the training procedure and therefore the accuracy on the regression task. One study will be called C-VAE from here on. In this alteration of the network, there are no specialized Siamese encoders and decoders, but the cyclic structure is kept. C-VAE should still be able to make meaningful cell predictions, albeit that the abstraction between natural and synthetic data has to happen in the inner layers of the VAE. The cyclic structure and the difference between original and reconstructed images can still help the architecture to enrich the data in a more extensive way than classical data augmentation alone can. The second study is called S-VAE. Here, the Siamese architecture is kept, but we omit the cycling and do not use the reconstructed image data as new input, but merely as reconstruction loss, as in the standard VAE. As there are no labels on the natural data and there is no translated pseudo-natural imagery with labels either, the regressor lacks a loss to meaningfully train for this type of data directly, but could possibly abstract from the differentiation between natural and synthetic data in the latent space and still achieve adequate accuracy on cell counting.

#### **4. Results**

As for the hyperparameter choices, the best results were achieved with decoder loss factors *C*<sup>n</sup> Rec <sup>=</sup> <sup>1</sup> <sup>×</sup> 102 and *<sup>C</sup>*<sup>s</sup> Rec <sup>=</sup> <sup>2</sup> <sup>×</sup> 102, with the higher loss on synthetic data accounting for the higher image variety of these images, while *<sup>C</sup>*n→<sup>s</sup> Rec <sup>=</sup> *<sup>C</sup>*s→<sup>n</sup> Rec <sup>=</sup> <sup>5</sup> <sup>×</sup> <sup>10</sup><sup>1</sup> resulted in the lowest reconstruction losses. While not mandatory to minimize, a degradation in the deconstruction loss of translated images is almost always coupled with lower regression losses. The regressor loss factors for synthetic data *C*<sup>s</sup> Reg and pseudo-natural data *<sup>C</sup>*s→<sup>n</sup> Reg are both set to 5 and should inversely account for the ratio between the according types of data. The KLD factor *<sup>C</sup>*DKL <sup>=</sup> 1 yields the best results for the larger data set Nat-PC, while slightly larger factors work better for Nat-BF, constraining the inner representations of synthetic, natural, and translated images to be coupled tightly. Faster convergence was observed for smaller KLD factors, but the learning scheme tended to separate more between data types, resulting in better reconstructions but poorer regressions. Figure 5 shows the combined losses and indicates convergence.

In addition, a soft weight decay of <sup>2</sup> <sup>×</sup> <sup>10</sup>−<sup>5</sup> per epoch, a constant learning rate of 0.75 <sup>×</sup> <sup>10</sup>−5, and delaying the start of the regressor by <sup>25</sup> epochs are used to achieve the following results. Batch sizes of 128 for both types of microscopy imagery work best and the training runs for up to 20,000 epochs, as there are no significant improvements after this. Ablation studies with more synthetic data relative to natural data have been done as well. In general, the architecture appears to converge faster when measured by epochs, but when taking the increase in training batches per epoch into account and therefore measuring by the number of computations, the training speed is marginally lower in all cases, so we retain the 1:1 ratio.

**Figure 5.** Visualization of the combined losses of SC-VAE top-performing model during training with regularly applied tests, in this case of Nat-PC. It can be seen that after 20,000 epochs, convergence is imminent, but has not fully been reached. Accuracy on cell counts does not improve significantly after this point; only image reconstruction quality does. Since the primary goal is not to diminish the reconstruction and normalization losses to zero, but rather to balance out the different losses, the combined loss can only indirectly be interpreted as a convergence indicator. Nevertheless, larger and faster descents in the combined loss still resemble well-trained models, even if this is insufficient as a sole indicator of such.

#### *4.1. Comparison*

We present the results of our method and the comparative baselines in Table 2. The mean relative error (MRE) is a normalized error, taking the ground truth into account, i.e., in high cell count images, small absolute deviations do not increase the error as much as they do for low cell count images. When interpreting experimental results as a biological expert, in most cases, this is the more meaningful indication over the mean absolute error (MAE), which serves as the typical indicator in terms of a regression task. The bilateral alteration SC-VAE-B that uses fully cycled images (back and forth) results in marginal but reliable improvements, assimilating representations in the latent space, and should be considered our top candidate.

Our SC-VAE consistently outperforms the other state of the art methods Watershed, BiT, and EfficientNet by a wide margin. SC-VAE and its alteration SC-VAE-B correctly estimate around 62 % of the cell counts for the Nat-PC data set, and their predictions differ on average by only 0.5 cells from the true cell counts of the images, and they achieve approximately 5.1 % MRE. For the smaller Nat-BF data set, SC-VAE-B accomplishes 0.56 MAE, 6.5 % MRE, and 58.7 % accuracy. While Dual Transfer Twin-VAE achieves slightly better results for these data, they are attained by semi-supervised training, commonly not even compared to unsupervised methods. As such, Siamese-Cycle-VAE holds up against semi-supervised training methods and even exceeds them the case of the larger Nat-PC data set, making it suitable for reliable cell counting with various microscopy techniques.

Moreover, we see that Siamese-Cycle-VAE performs well across the entire range of cell counts in Nat-PC and Nat-BF. By contrast, Watershed and EfficientNet struggle with images that contain few cells, which is the most important range of cell counts for biological tasks, such as estimating the growth rate.

The ablation C-VAE that feeds all data through the same encoder and decoder results in accuracy on synthetic data that is inferior to the other methods, even more so for the important accuracy on natural data. By using the reconstructed images as new input, the learning scheme resembles the optimized scheme of SC-VAE in such a way that visual intricacy on natural data is simplified, but not on the same level as SC-VAE..

S-VAE, on the other hand, worked best on synthetic data, especially so for Nat-BF, but for both types of microscopy data, the MRE and accuracy on the natural data are far from the results from SC-VAE. No translated natural data are generated by S-VAE, which is missing the regression loss for natural data completely. Cell counts on natural data are not random since there is still the shared encoder to unify the two types of data, but since accuracies differ vastly between natural and synthetic data, the S-VAEs encoder fails to do so because of a missing incentive.

#### *4.2. Image Reconstruction and Representation*

An analysis of the reconstruction abilities of Siamese-Cycle-VAE is useful to ensure that the shared representation is meaningful, even though our main aim is automatic cell counting, not perfect image reconstruction.

During the training of Siamese-Cycle-VAE, the image inputs are processed by their respective encoder, followed by the general, weight-shared encoder, represented in the bottleneck of the architecture; they are then processed by the shared decoder and finally reconstructed by their specialized decoder accordingly (see Figure 4). The same is true for auxiliary data and both types of translated pseudo-imagery. To ensure that the actual regressive task works as intended for natural images, it must be able to benefit from synthetic data representations in the latent space, so the learned representation must be shared by the four types of data.

This can be verified by encoding natural images with their appropriate encoder, but performing the decoding with the decoder that is designed and trained for auxiliary images, the counterpart to the opposite conversion, which is done in every epoch of training. Minimal changes in the stages of the images that are converted back and forth indicate the close coupling of the representations. The closer the different data types are transformed into the latent space, the greater the potential gain for regression on natural data. Moreover, the conversion makes this fact interpretable on a visual level.

We show examples of perfect translations in Figure 6. For these samples, a natural image is encoded and then decoded as a synthetic image. The number of cells remains unchanged, and the position and size of the cells are also maintained. However, the overall appearance is simplified: Siamese-Cycle-VAE learned to remove noise and to break down the reconstruction to the essentials. Even the very large smudge on the left natural image has not been reconstructed; although it will cause an increased loss in the reconstruction, the weighting of the loss factors makes it more acceptable to forfeit image reconstruction precision in favor of the regression. On the right side, it can be seen that the output does indeed appear more similar to natural data than the synthetic input does, while fine details such as the noisy borders are not recreated.

The ongoing cell division shown in Figure 7 is a prime example to understand how Siamese-Cycle-VAE works. The membrane of the bottom right cell is not fully enclosed and there is no overlap, since a fine bright border of the underlying cell would be seen through the top cell. However, two cell cores can clearly be seen and a human expert would presumably count this situation as two cells, which is exactly what Siamese-Cycle-VAE does. The prediction of 9.65 instead of 10 can be understood as uncertainty and a slightly earlier stage of the division would have arguably led to a slightly smaller prediction, which, when rounded, would be the correct cell count again. The effect of simplified visuals also happens in these non-translated reconstructions; the smudges on the Nat-BF sample are clearly fainter and, in the left image, even the high-contrast dead cell residue on the left is not recreated. This clearly indicates that even when Siamese-Cycle-VAE does not predict the cell count perfectly in an image, the comparison between the original and reconstructed image is useful to understand where an error occurs.

**Figure 6.** Examples of translations used for cycling. From left to right: natural image, according translated image from natural to synthetic, synthetic image, according translated image from synthetic to natural. Compositions stay the same but the visual style has been transferred. The translated images can now be used in the encoder designed for the type of data that they are imitating and thereby serves a special purpose for each of the two translations: enriching the VAEs process of encoding and decoding with unseen data, which is especially helpful for the natural coders due to the limited availability of natural data (*syn* → *nat*), and allowing the regressor that is well trained to handle synthetic data to count cells in translated natural data (*nat* → *syn*).

**Figure 7.** Examples of synthetic-looking reconstructions of a natural images. The reconstructions are to the right of their natural counterparts. The composition of cells stays the same, positions are near identical, cell sizes are preserved, and smudges are not recreated, or, if so, they are very faint, semantically not impacting the regression task too much, since it learns to extract the encoding of large, high-contrast cell boundaries. When rounded to full numbers, the cell counts of 10 on the left and 4 on the right match exactly. Without rounding, on the left side, the predicted cell count is too low by 0.35. This can be interpreted semantically as the ongoing cell division that happens in the bottom right of the image.

#### *4.3. Shared Representation*

Siamese-Cycle-VAE's ability to translate back and forth between natural and synthetic images illustrates the semantically shared representation of all four types of data learned by the autoencoder. Below, we visualize this shared representation. Because each image is encoded as a 256-dimensional vector, we need to reduce the dimensionality to do so. Uniform Manifold Approximation and Projection (UMAP) [39] has established itself as the state of the art for nonlinear dimensionality reduction. It computes a topology-preserving embedding that can be used for semantic interpretations of representations. In the resulting embedding (see Figure 8), we see that synthetic and natural data occupy the same space, and we can even observe that both types of translated images also lie on the same projection space.

Therefore, UMAP is unable to separate the latent representations of the different types of data and this allows us to visually understand what is meant by tightly coupled representations. UMAPs are non-parametric; therefore, the axis and scale have no meaning other than the preserving of relations. Since we can observe that, along the main axis, the cell count has been chosen as the most mandatory factor, it is the main determining factor in the latent space, providing perfect conditions for a well-functioning regressor, since images are represented vastly differently, dependent on the number of cells that they include.

**Figure 8.** Embedding of the trained representations, determined via UMAP. Illustrated are natural, labeled test samples (red circles); unlabeled test samples (grey); synthetic samples (blue), and both types of translated images, *syn* → *nat* (green) and *nat* → *syn* (yellow). Cell counts are separated by brightness, with darker dots indicating low cell counts and brighter dots indicating a high cell count. Since UMAPs are non-parametric, axis and scale have no meaning, but relations are preserved. Since dots become visibly brighter from left to right and this is the main axis along which the dots are separated, UMAP has determined this direction to be the most important and it directly corresponds to cell counts. Simultaneously, natural and auxiliary images do not become separated. If this were the case, it would contradict a truly shared representation between the different types of data.

It can be seen that data that have been translated from synthetic to natural (green) tend to encapsulate the synthetic data (blue); this is more so the case for the natural data that are translated to synthetic (yellow), which encapsulate the original natural data (red and gray). This can be interpreted as semantic coverage, which means that, for every possible natural, unlabeled data point, there are labeled data points nearby, demanding only minor abstractions of the regressor to be able to achieve a meaningful cell prediction.

Another way to ensure meaningful representations and condensed information in the bottleneck of the Siamese-Cycle-VAE architecture is to sample images from noise vectors and check two aspects of them: first, they should show deceptive images that could be reconstructions from real data of their type, and, secondly, slight changes to the random vectors should result in similar but not identical images. Both behaviors can be observed in Figure 9; therefore, the latent representation contains information in a semantically meaningful way.

The distribution of the UMAP also suggests that certain areas of the latent space serve to represent a determinable number of cells. We tested this and found that there are indeed areas in the latent space that lead to the reconstruction of low cell counts, and, within the local area, all reconstructions result in low cell counts, while other areas can be found that represent the presence of high cell counts in input images, and this is exactly what is reconstructed by the decoders, when the latent space is sampled in this area.

**Figure 9.** Samples generated from the latent space by inputting noise vectors and deconstructing them with the natural (row 1) and synthetic (row 3) decoder. Appropriate cell imagery can be reconstructed from these; consequently, the latent space meaningfully represents the important information of possible input images for this domain. Adding and subtracting tiny amounts to and from these vectors results in semantically similar images (row 2) with often only one cell more or less, where the cells are slightly larger or smaller and have changed position slightly, while samples from a completely different part of the latent space yield completely different images.

#### **5. Discussion**

We now discuss the limitations of this architecture and state possible revisions to overcome them. During analysis, we found that for very small cells in the natural data, only subpar precision is achieved. Since the working resolution of the architecture is 128 × 128 pixels, these cells are barely visible in the downscaled versions of the images and can therefore not yield low error estimations. In future work, the working resolution could be doubled per axis, which requires new layers in the specialized encoders and decoders, but leaves the rest of the architecture unchanged. Alternatively, local crops of quarters of the images could be used, allowing a quasi-double resolution by answering the question of cell count with the sum of 4 quarters.

Large and high-contrast light reflections can also be problematic for satisfactory regression. When scaling down an image such as the phase-contrast microscopy on the left in Figure 1, the smaller reflections are merely a single bright pixel in the working resolution, too small to impact the cell count. When these reflections are larger, as with the one on the very left, it can lead to quite high reconstruction losses and cause the architecture to replicate these, although they should be filtered out and ignored. To overcome this, a step in the image preprocessing could be added that seeks this effect and dims the affected area. More elegantly, the reconstruction loss could be capped with local maximums, so that the high deviations that derive from this are not fully accounted for in the training of the network. Further, although the proxy image generator is merely auxiliary content for this work, currently, new microscopy imagery makes it obligatory to find appropriate parameters for the generator, accounting for cell sizes, border brightness, etc. A more sophisticated generator could be able to algorithmically generate auxiliary data automatically from given natural data.

Due to the different types of network parts present in the architecture and the resulting loss of Equation (5), it can be difficult to understand the importance of optimization of the different parts of the composite loss. Forcing better reconstructions by setting the according weight factors to high numbers may bring the disadvantage of worse regression, but this is not necessary, because, to some extent, better reconstructions will also help to ensure that the existence of cells is represented in the latent space, which is a major requirement for the regressor to achieve high accuracy.

The amount of hand-crafting meta-parameters could be reduced by the more extensive use of meta-learning systems, such as a modified regressor for the Gaussian process that we used, to enable the creation of a simple tool that users of a complete solution can utilize for cell counting during live cell imaging experiments. Thus far, alternating between automated meta-learning and hand-crafting with multiple parallel runs with different meta-parameter choices has been utilized to find good parameters quickly.

Implementations for the real-time, continuous estimation of cell counts in experiment monitoring would be a practical way to make this architecture and its learning scheme easily usable for biologists. Despite these limitations, with SC-VAE, it is possible to outperform state of the art alternatives, sometimes by a wide margin, and it can compete with its semi-supervised predecessor.

Surprising findings were that the weight factors of the KLD loss in Equation (4) can be quite low and therefore hinder the learning process from ensuring shared representations only very little, only if the parameters of the other losses are chosen well. We are inconclusive regarding what makes them well chosen, but the parameters that we found allow a very high loss factor for regression, especially for translated pseudo-natural images, without the representations becoming separated or the loss or the architecture becoming unstable, a common outcome in other literature when weighing the loss of the main task as too high and devaluing the loss of indirect tasks or those only achievable late in sequential learning schemes.

We will now conclude the contribution and summarize our findings.

#### **6. Conclusions**

With our specialized learning scheme, we created a basis for automated cell counting in the domain of microfluidic cell cultivations, and we presented a workflow for the unsupervised image recognition of mammalian suspension cells, obtained by live cell imaging. The auxiliary data generator presented delivers arbitrary amounts of synthetic microscopy imagery and, with only minor adjustments, can also generate images for entirely different types of cells and microscopy technologies. SC-VAE demands only rough similarity between synthetic and natural data, omitting the laborious task of replicating the intricate visual details of the natural data. The presented technique operates independently of the actual cell sizes of the organism being studied, and the adaptation to, e.g., elongated bacterial cells or plant cells can be done easily.

In Section 1, we mentioned that the manual procedure of labeling such imagery by human experts is not feasible and requires automation. We overcome this issue by delivering an end-to-end solution that is usable not only by experts, requires no handlabeled data at all, and still competes with semi-supervised state of the art solutions that do require manual labels. We also present an innovative means of gaining insights into the latent spaces of these type of Siamese networks by comparing cycled images, i.e., images converted back and forth, to their original counterparts and by translating natural data to pseudo-synthetic data to particularly ensure the stability of the internal representations and a meaningful latent space distribution from which we can sample freely, in such a way that is understandable to the human eye.

The Siamese-Cycle-VAE architecture helps us to understand what requirements exist for the presence, quantity, and quality of natural data in the image processing domain, specifically related to an unsupervised regression task.

Moreover, we show that our specialized learning scheme grants SC-VAE the ability to abstract from the fact that data are synthetic by ensuring that all elements of the architecture that tend to discriminate between different types of data are vastly overruled by elements that do not tend to do so. Only due to the novel learning scheme that we present, it is possible to generate a meaningful loss without any labeled original data.

We encourage future learning methods and architectures in other domains but with similar research questions and obstacles, especially the lack of labeled data, to adapt the general idea of this machine learning scheme and architecture in the future, albeit with different types of difficulties, especially for those cases where the generation of auxiliary data cannot be directly coupled to a target variable or classification, i.e., domains where the full coverage of possible natural data by synthetic data is not trivial.

**Author Contributions:** Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing, and visualization have been performed by D.S. Review, supervision, project administration, and funding acquisition have been performed by B.H. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Ministerium für Kultur und Wissenschaft NRW, grant number NW21-059A (SAIL).

**Data Availability Statement:** We have made the data sets available at https://pub.uni-bielefeld.de/ record/2960030 (accessed 27 February 2023) and make the source code available at https://github. com/dstallmann/cyclic\_siamese\_learning (accessed 27 February 2023).

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

### *Article* **How to Open a Black Box Classifier for Tabular Data**

**Bradley Walters, Sandra Ortega-Martorell, Ivan Olier \* and Paulo J. G. Lisboa**

School of Computer Science and Mathematics, Liverpool John Moores University, Liverpool L3 2AF, UK **\*** Correspondence: i.a.oliercaparroso@ljmu.ac.uk

**Abstract:** A lack of transparency in machine learning models can limit their application. We show that analysis of variance (ANOVA) methods extract interpretable predictive models from them. This is possible because ANOVA decompositions represent multivariate functions as sums of functions of fewer variables. Retaining the terms in the ANOVA summation involving functions of only one or two variables provides an efficient method to open black box classifiers. The proposed method builds generalised additive models (GAMs) by application of L1 regularised logistic regression to the component terms retained from the ANOVA decomposition of the logit function. The resulting GAMs are derived using two alternative measures, Dirac and Lebesgue. Both measures produce functions that are smooth and consistent. The term partial responses in structured models (PRiSM) describes the family of models that are derived from black box classifiers by application of ANOVA decompositions. We demonstrate their interpretability and performance for the multilayer perceptron, support vector machines and gradient-boosting machines applied to synthetic data and several real-world data sets, namely Pima Diabetes, German Credit Card, and Statlog Shuttle from the UCI repository. The GAMs are shown to be compliant with the basic principles of a formal framework for interpretability.

**Keywords:** ANOVA; Shapley values; self-explaining neural networks; generalised additive models; interpretability

#### **1. Introduction**

Machine learning models can be inherently interpretable, typically by fitting decision trees [1] or even by representing an existing black box model, such as a neural network, by extracting rules, whether using decompositional methods to explain the activity of individual hidden neurons, or applying pedagogical methods to fit the decision surface with axis-orthogonal hypercubes [2]. Decision trees have been successfully used to build transparent models in high-stakes applications [3].

Alternatively, statistical models, such as logistic regression, have high classification performance for the levels of noise typical for clinical prediction models [4]. Both decision trees and logistic regression have been successful and have global interpretability. However, each has a significant limitation. Rule sets can grow so complex as to become opaque to the user. Generalised linear models, while accurately modelling the nature of chance variation in the data through an appropriate choice of the output distribution function, require a priori choices of attribute factors, often resorting to categorising input variables to better capture non-linearities in the data.

A model that is arguably gold-standard should combine linear additivity with the automatic estimation of the non-linear dependence of the prediction on individual variables or pairs of variables. This can be achieved with generalised additive models (GAMs) [5]. We investigate to what extent it is possible to buck the performance-interpretability tradeoff for tabular data by deriving GAMs from existing black box models, or using standard machine learning approaches to seed a GAM, keeping only univariate and bivariate terms.

Opening black boxes with ANOVA in this way is attractive because GAMs quantify Bayesian models in a way that is natural for human thinking. In particular, the representation of the logit as a GAM represents the prediction of the probability of class membership

**Citation:** Walters, B.; Ortega-Martorell, S.; Olier, I.; Lisboa, P.J.G. How to Open a Black Box Classifier for Tabular Data. *Algorithms* **2023**, *16*, 181. https:// doi.org/10.3390/a16040181

Academic Editors: Xiang Zhang and Xiaoxiao Li

Received: 14 February 2023 Revised: 13 March 2023 Accepted: 17 March 2023 Published: 27 March 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

as a combination of independent effects, much in the way that logistic regression does, but allowing for non-linear functions of the input variables. Specifically, for input dimensionality *d*, the odds ratio of this probability has the form

$$\frac{P(\mathbb{C}|\mathbf{x})}{1 - P(\mathbb{C}|\mathbf{x})} = e^{\varrho\_1(\mathbf{x}\_1)} \cdot e^{\varrho\_1(\mathbf{x}\_2)} \cdot \dots \cdot e^{\varrho\_{d(\mathbf{x}\_d)}} \cdot e^{\varrho\_{1,2}(\mathbf{x}\_1, \mathbf{x}\_2)} \cdot \dots \cdot e^{\varrho\_{(d-1),d}(\mathbf{x}\_{(d-1)}, \mathbf{x}\_d)} \cdot x^{\varrho\_0} \tag{1}$$

where the terms *ϕi*(*xi*) are univariate functions, hence easily interpreted, *ϕij*- *xi*, *xj* are bivariate functions, which can also be easily plotted, and *ϕ*<sup>0</sup> is the *null model* for which all of the input variables *ϕi*(*xi*) and *ϕi*,*<sup>j</sup>* - *xi*, *xj* are set to 0.

From a Bayesian perspective, each component *e<sup>ϕ</sup>* models the contribution of an individual variable or pair of variables, which can enhance or suppress the P(C|x) depending on whether the value taken by the argument function *ϕ* is positive or negative, acting on the baseline value *eϕ*<sup>0</sup> , which, if *ϕi*(0) is always 0, corresponds to the prior odds ratio in the absence of any of the input variables being present.

As the variables are entered into Equation (1), they modulate the prediction of P(C|x), much in the same way as a human observer can start with a prior probability, e.g., for the diagnosis of a clinical state, then modulate that diagnosis as the observations of the symptoms are made; each symptom contributing to increasing or reducing the probability of diagnosing the clinical state according to Equation (1). In the medical domain, many risk models are quantified by exactly this model, usually expressed as a risk score, namely

$$score(\mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_d) = \sum\_{i=1}^d \beta\_i \cdot \mathbf{x}\_i \tag{2}$$

with

$$\log \text{fit}(P(\mathbf{c}|\mathbf{x})) = \log \left( \frac{P(\mathbf{C}|\mathbf{x})}{1 - P(\mathbf{C}|\mathbf{x})} \right) = \beta\_0 + \text{score}(\mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_d) \tag{3}$$

This corresponds to the GAM defined by Equation (1) with only univariate terms given by

$$p\_i(\mathbf{x}\_i) = \beta\_i \cdot \mathbf{x}\_i \tag{4}$$

#### *1.1. Related Work on Self-Explaining Neural Networks*

Interpretable neural network models have a long history starting with generalised additive neural networks (GANNs) [6–8], also called self-Explaining neural networks (SENNs) [9], which consist of a multilayer perceptron with modular structures that are not fully connected but involve sums of sub-networks, each representing functions of a single input variable or pair of variables [6]. However, they lack efficient methods to carry out a model selection to avoid modelling spurious correlations by including too many variables.

Our method relies on the analysis of variance (ANOVA) decompositions [10]. Although ANOVA is well known in mainstream statistics, its potential to derive interpretable models from pre-trained black box machine learning algorithms has not been fully exploited. In his paper introducing gradient boosting machines, Friedman notes that partial dependence functions can help "interpret models produced by any black box prediction method, such as neural networks, support vector machines, etc." [11]. However, this referred to the visualisation of the model's dependence on covariates, which applies locally only at the data median, rather than building a predictor that applies globally and so can be used to make predictions over the complete range of input values with the same additive model.

Other algorithms to derive predictive additive models have been proposed recently. They are neural additive models (NAM), where the univariate functions are each modelled with a separate multilayer perceptron or deep learning neural network [12] and explainable boosting machines (EBM) [13], which includes both univariate and bivariate terms. A recent refinement of these methods is the GAMI-NET [14]. This model estimates main (univariate) effects and pairwise interactions in two separate stages, building bespoke

neural networks to model each effect and interaction. None of these models will open an existing black box since they built a SENN structure first rather than applying function decomposition to a given multivariate function, as achieved with ANOVA.

Moreover, all the above models have limitations either in feature selection or in the structure of the model itself. In particular, NAMs favour a model structure that includes univariate functions for all the input variables and lack a clear process for selecting bivariate component functions. In contrast, EBMs incorporate model selection with statistical tests; in fact, ANOVA significance tests applied to partial dependence functions, proposed by [11], which are similar to the marginal functions used in Section 2.1.1 to calculate partial responses from ANOVA decompositions with the Lebesgue measure. This permits the inclusion of bivariate functions in the additive model. However, the EBM component functions are jagged because they are built from hyper box cuts in input space. The GAMI-NET requires explicit sparsity and heredity constraints, along with what is called marginal clarity, which is a penalisation term to enforce orthogonality between the main effects and the pairwise interactions. This is motivated by the functional ANOVA decomposition, implicitly using the marginal distribution, although it is not clear whether this observes the constraint raised in [11] to ensure that correlations among the input variables do not bias the orthogonality calculation. Our approach uses the ANOVA decomposition directly and so keeps the training process much simpler.

All of the above methods are stand-alone algorithms rather than explaining predictions made by pre-trained black boxes. This is also the case for sparse additive models (SAM) [15], where the component univariate and bivariate functions in a GAM are implemented with splines in contrast to our use of neural networks, which are semi-parametric, and hence, less restrictive. Moreover, splines can over-regularise the model and miss important details in the data, as well as being inefficient for estimating bivariate terms due to a proliferation of spline parameters. A further model, sparse additive machines [16], derives GAM structures from SVM models. It is scalable and has a provable convergence rate that is quadratic on the number of iterations, but this is not probabilistic and does not include pairwise terms.

The motivation for considering ANOVA as a method to open black box models is that each measure used in ANOVA is closely related to an intuitive approach for the decomposition of multivariate functions into predictive models with fewer variables. The Dirac measure filters from the multivariate response precisely the terms in the Taylor series, centred at the data median, which are dependent on just one or two variables [17], while the Lebesgue measure marginalises the response surface over one of two variables [18]. The proposed method is computationally efficient and stable for variable selection.

#### *1.2. Contributions to the Literature*

The main hypothesis of this paper is that the low-order functions derived by ANOVA from arbitrarily complex machine learning or other probabilistic classifiers contain sufficient information to open the black box models while retaining the classification performance measured by the area under the receiver operating characteristic curve (AUC). The resulting models are interpretable by design [19].

The proposed generic framework to extract GAMs from black boxes is termed partial responses in structured models (PRiSM). An instantiation of the framework has been demonstrated by applying the simplest measure used by ANOVA. This performed well on two medical data sets about intensive care [20] and heart transplants [21]. The focus of the latter is a detailed clinical interpretation of the partial response network (PRN), which is an anchored model, i.e., it combines functions restricted to be zero at the median values of the data. This paper makes a comprehensive approach to the proposed method and contributes novelty in the following respects:

• Comprehensive presentation of the generic framework for deriving PRiSM models from arbitrary black box binary classifiers, reviewing the orthogonality properties of ANOVA for two alternative measures: the Dirac measure, which is similar to partial dependence functions in visualisation algorithms [11] and produces component functions that are tied to the data median; the Lebesgue measure, which involves estimates of marginal effects and is related to the quantification of effect sizes [7]. The method is tested on nine-dimensional synthetic data to verify that it retrieves the correct generating variables and achieves close to optimal classification performance;


The univariate and bivariate component functions representing the additive contributions to the logit of the model prediction are what we call partial responses. Note that while univariate component functions are numerically identical to partial dependence plots, the bivariate functions are not since they are obtained by removing the univariate dependence via the orthogonality properties of ANOVA decompositions. Moreover, the univariate and bivariate component functions are not used here purely for visualisation. Their values are the nomogram of the model, i.e., the ordinate of the figures shown later is in all cases the precise contribution of the variables to the model prediction, which is for every data point merely the summation of the contributions from all of the variables in the logit space.

The derived models retain a direct link between the input variables and the model predictions, meeting the requirements of the three "Cs" outlined earlier, and have comparable classification performance to the original black box models. This is demonstrated by application to four real-world data sets: UCI Diabetes, UCI German Credit Card, and Statlog Shuttle. The first three were used as benchmark data sets, whilst the last one was chosen as it was used in related work [12].

We refer to the overall framework to open pre-trained black boxes by deriving sparse models in the form of GAMs and SENNs as the integration of partial responses into structured models (PRiSM), Figure 1.

**Figure 1.** Schematic of the PRiSM framework. Any multidimensional decision function can be represented by a spectrum of additive functions, each with only one or two inputs. The final prediction of the probability of class membership, *<sup>P</sup>*ˆ(*C*|*X*), is given by the sum of the univariate and bivariate component functions, scaled by the coefficients *βxi* , *βxij* derived by the least absolute shrinkage and selection operator (LASSO). Since only univariate *ϕ*(*xi*) and bivariate *ϕ xi*, *xj* component functions are in the model, their shapes provide a route towards interpretation by end-users.

#### **2. Materials and Methods**

*2.1. Methods*

$$2.1.1. \text{ANDVA Decisionposition}$$

The first novelty of the paper is to apply an ANOVA decomposition [10] to pre-trained black box probabilistic binary classifiers in order to extract from the *logit(P(C|x))*, which is a multivariate function, component functions of fewer variables.

The ANOVA decomposition is defined as follows

$$\begin{split} \log \text{fit}(P(\mathbb{C}|\mathbf{x})) & \equiv \log \left( \frac{P(\mathbb{C}|\mathbf{x})}{1 - P(\mathbb{C}|\mathbf{x})} \right) \\ &= \varphi\_0 + \sum\_{i} \varphi\_i(\mathbf{x}\_i) + \sum\_{i \neq j} \varphi\_{ij}(\mathbf{x}\_i, \mathbf{x}\_j) + \dots \ \underset{i\_1 \neq \dots \neq i\_P}{\sum\_{i\_1 \neq \dots \neq i\_P}} \varphi\_{i\_1} \dots \ x\_p \left( \mathbf{x}\_{i\_1 \wedge \dots \wedge i\_P} \right) \end{split} \tag{5}$$

where the general form of the terms in (5) is a recursive function of the nested subsets of the covariate indices {*i* <sup>1</sup>, ... , *iP*} with the property that the term involving all of the covariates *xi:i =* 1... *P*, where *P* is the dimensionality of the input data is given by

$$\begin{array}{c} \varphi\_{i\_1} \dots \varphi\_{i\_{\mathcal{P}}} \left( \mathbf{x}\_{i\_1}, \dots, \mathbf{x}\_{i\_{\mathcal{P}}} \right) \\ \qquad = \operatorname\*{logit} \left( P \left( \mathbb{C} \left| \mathbf{x}\_{i\_1}, \dots, \mathbf{x}\_{i\_{\mathcal{P}}} \right) \right) - \sum\_{\{i\_1 \neq \dots \neq i\_{\mathcal{P}-1}\}} \varphi\_{i\_1 \dots i\_{n-1}} \left( \mathbf{x}\_{i\_1}, \dots, \mathbf{x}\_{i\_{\mathcal{P}-1}} \right) - \varphi\_0 \end{array} \tag{6}$$

Note that Decomposition (5) is an identity that exactly reproduces the values of the *logit*(*P*(*C*|*x*)), originally predicted by the black box classifier. We call the component functions *ϕi*<sup>1</sup> ...*in* - *xi*<sup>1</sup> ,..., *xin* partial responses, since they involve only a subset of the input variables.

The general form of the component terms is given by the following equations, which depend only on the chosen measure *μ*(*x*)

$$\varphi\_0 = \int\_{\left[\mathbf{x}\right]^p} \log it(P(\mathbf{C}|\mathbf{x}))d\mu(\mathbf{x})\tag{7}$$

$$\mathfrak{gl}\_{\mathbb{S}}(\mathfrak{x}\_{\mathbb{S}}) = \int\_{\left[\mathbf{x}\right]^{\mathbb{P}} - \left|\mathbf{S}\right|} \log \mathfrak{l}(P(\mathbb{C}|\mathbf{x})) d\mu(\mathfrak{x}\_{-\mathbb{S}}) - \sum\_{T \subsetneq \mathbb{S}} \mathfrak{gl}\_{T}(\mathfrak{x}\_{T}) \tag{8}$$

where *<sup>S</sup>* <sup>∈</sup> *<sup>R</sup><sup>S</sup>* represents a subset of variables with dimensionality *|S|* <sup>≤</sup> *<sup>P</sup>*. The terms *xs* and *x*−*<sup>s</sup>* denote, respectively, the subspace spanned by *S*:|*S*| = *n* in Equation (6) and its complement −*S*:|−*S*| = *d* − *n*.

It follows from (7) and (8) that the terms *ϕ<sup>S</sup>* are normalised with respect to the chosen measure 

$$\int\_{S} \varphi\_{\mathcal{S}}(\mathbf{x}\_{\mathcal{S}}) d\mu(\mathbf{x}\_{\mathcal{j}}) = 0, \text{ if } j \in \mathcal{S} \tag{9}$$

and also orthogonal for disjoint variable sets *S* and *T*

$$\int\_{S} \varrho\_{S}(\mathbf{x}\_{s}) \varrho\_{T}(\mathbf{x}\_{T}) d\mu(\mathbf{x}) = 0,\text{ if } S \neq T. \tag{10}$$

There are two natural choices of measure, each of which will define the functionality of each of the component terms *ϕ<sup>S</sup>* in response to either one or two arguments:

• Dirac measure

$$d\mu(\mathbf{x}) = \delta(\mathbf{x} - \mathbf{x}\_{\mathbf{c}})d\mathbf{x} \tag{11}$$

An arbitrary point *xc* that is called anchor point. The partial responses become cuts through the response surface for the *logit(P(C|x).*

• Lesbesgue measure

$$d\mu(\mathbf{x}) = \rho(\mathbf{x})d\mathbf{x} \tag{12}$$

where *ρ*(*x*) is the density function of the variables in the argument of the integral. This measure calculates the weighted mean of the integrand.

In both cases, the data matrix *X* is first centred using the overall median of the data and scaled by the marginal standard deviation:

$$X \to \left(X - median(X)\right) \left/ \begin{array}{l} \text{std}(X) \\ \text{std}(X) \end{array} \right. \tag{13}$$

The absence of a variable now corresponds to fixing it at the median value, since the median point corresponds to a vector of 0 s. Therefore, the logit value then takes the value *logit*(*P*(*C*|0)). Similarly, if all of the variables except *xi* are set to their median values, then the corresponding values of *logit*(*P*(*C*|(0, . . . , *xi*,...,0))) represent a function of just that one variable. The same principle applies when only two variables are not 0, then three variables, etc.

The partial responses for the Dirac measure are calculated according to

$$\varphi\_0 = \log \text{it}(P(\mathbb{C}|0)) \tag{14}$$

$$\varrho\_i(\mathbf{x}\_i) = \log it(P(\mathbb{C}|(0, \dots, \mathbf{x}\_i, \dots, \mathbf{0}))) - \varrho\_0 \tag{15}$$

$$\varphi\_{\vec{\imath}\vec{\jmath}}(\mathbf{x}\_{i},\mathbf{x}\_{\vec{\jmath}}) = \operatorname{logit}\left(P\left(\mathbb{C}\left|\left(0,\ldots,\mathbf{x}\_{i},\ldots,\mathbf{x}\_{\vec{\jmath}},\ldots,\mathbf{0}\right)\right)\right) - \varphi\_{\vec{\imath}}(\mathbf{x}\_{i}) - \varphi\_{\vec{\jmath}}(\mathbf{x}\_{\vec{\jmath}}) - \varphi\_{0} \tag{16}$$

In the case of the Lebesgue measure, the integrals in Equations (7) and (8) are calculated empirically using the training data, with sample size *N* observations

$$\mathcal{F}\_{\mathcal{S}}(\mathbf{x}\_{s}) = \frac{1}{N} \sum\_{k=1}^{N} \log \text{it} \left( P \left( \mathbb{C} \left| \mathbf{x}\_{\mathcal{S}\prime} \mathbf{x}\_{-\mathcal{S}}^{k} \right. \right) \right) \tag{17}$$

where the variables with dimensions *xs* take any desired values but those in the complement set with dimension *x<sup>k</sup>* <sup>−</sup>*<sup>S</sup>* are fixed at their actual values in the training set *k =* <sup>1</sup> ... *<sup>N</sup>* [11]. This corresponds to shifting all onto the coordinate(s) *xs* so that in the summation (17), every data point has the same value of this input dimension while retaining the original values for all other coordinates.

The orthogonalised partial responses *ϕS*(*xs*) follow by using Equation (8).

$$\phi\_0 = \frac{1}{N} \sum\_{k=1}^{N} \log \text{it} \left( P \left( \mathbb{C} \left| \mathbf{x}^k \right. \right) \right) \tag{18}$$

$$
\hat{q}\_i(\mathbf{x}\_i) = \hat{F}\_i(\mathbf{x}\_i) - \hat{q}\_0 \tag{19}
$$

$$\phi\_{\vec{i}\vec{j}}(\mathbf{x}\_{i\prime}\mathbf{x}\_{\vec{j}}) = \mathbf{f}\_{\vec{i}\vec{j}}(\mathbf{x}\_{i\prime}\mathbf{x}\_{\vec{j}}) - \phi\_{\vec{i}}(\mathbf{x}\_{i\prime}) - \phi\_{\vec{j}}(\mathbf{x}\_{\vec{j}}) - \phi\_0 \tag{20}$$

#### 2.1.2. Model Selection with the LASSO

The resulting terms in the truncated ANOVA decomposition comprising only the univariate and bivariate terms in (5) need to be re-calibrated to maximise the predictive classification performance. In addition, having treated the higher-order terms as noise, the remaining terms need also to be filtered to remove non-informative partial responses.

This is achieved through a second step involving the application of the logistic regression with the LASSO [24], treating *P* ∗ (*P* + 1)/2 terms in the truncated ANOVA decomposition as the new input variables. The *L*<sup>1</sup> regularisation is robust for hard model selection by sliding to zero the value of the linear coefficients for the least informative variables, which are now partial responses.

Since the partial responses are generally non-linear functions of one of two variables, they are readily interpretable. This is not new having previously been a widely used approach to visualise non-linear models with partial dependence functions that are functionally equivalent to ANOVA terms with the Dirac measure. What is new is the realisation that when these functions are calculated for the *logit(P(C|x)* and used for prediction with tabular data, they can achieve comparable AUCs to those of the original black box models from which the partial responses are derived.

#### 2.1.3. Second Training Iteration

If the original black box model is an MLP, it is possible to construct a GANN/SENN to replicate the output of the logistic Lasso by a replication of the weights from the MLP multiplied by the coefficients of the Lasso. This permits an additional step of refining the partial responses themselves by initialising the SENN at the operating point of the GAM.

Given the weights \$ *wij*, *bj*, *vj* , *v*<sup>0</sup> % of the original pre-trained fully connected MLP, Figure 2, with the inputs indexed by *i* and hidden nodes indexed by *j*, together with the coefficients & *β*0, *β<sup>i</sup>* , *βij*' fitted by the Lasso, the PRiSM model for the anchored decomposition can be exactly mapped onto an MLP structured in the form of a GANN/SENN, as follows.

**Figure 2.** The partial response network (PRN) has the modular structure typical of a self-explaining neural network. In particular, the figure illustrates the connectivity for a univariate function of input variables x1 and xp, and a bivariate term involving variables x2 and x3. Modelling the interaction term as orthogonal to univariate terms involving the same variables requires three blocks of hidden nodes, as explained in the main text. If there are univariate additive component functions involving either x2 or x3 these are added to the structure by inserting additional modules, as shown for x1 and xp.

(1) Univariate partial response corresponding to the input *xi*

This is shown in Figure 2 for input *x*1. Zero inputs for all other inputs will not contribute to the activation of the hidden nodes. The hidden layer weights *wij* connected to node *xi* remain the same as in the original MLP but the weights and bias to the output node need to be adjusted as follows:

$$
v\_j \to \beta\_i \* v\_j \tag{21}$$

$$
v\_0 \to \beta\_i \* (v\_0 - \log it(P(C|0)))\tag{22}$$

(2) Bivariate partial response for the input pair {*xi*,*xj*}

This is shown in Figure 2 for inputs *x*<sup>2</sup> and *x*3. This time, to replicate the partial response multiplied by the Lasso coefficient, it is necessary to add three elements to the structure, namely, a univariate partial response for each of the inputs involved and a coupled network that both inputs feed into together. We will use the generic input indices *k* and *l* to avoid confusion with the hidden node index *j.* The hidden layer weights once again remain unchanged from the original MLP. The output layer weights and bias for the network structure representing the univariate term associated with input *k* (and similarly for input *l*) are adjusted by

$$
v\_{\dot{j}} \rightarrow (\beta\_k - \beta\_{k\dot{l}}) \* v\_{\dot{j}} \tag{23}$$

$$
v\_0 \to (\beta\_k - \beta\_{kl}) \* (v\_0 - \log \text{it}(P(\mathbb{C}|0)))\tag{24}$$

whereas the weights and bias for the coupled network are changed according to

$$
\upsilon\_j \to \beta\_{kl} \* \upsilon\_j \tag{25}
$$

$$
v\_0 \to \beta\_{kl} \* (v\_0 - \log it(P(\mathbb{C}|0))).\tag{26}$$

(3) Finally, an amount is added to the total sum of the values calculated for the bias term in the structured neural network. This amount is equal to the intercept of the logistic Lasso, *β*0.

Equations (23)–(26) have the property that when an interaction between two variables is identified after the first application of the Lasso, its mapping onto the structured neural network involves also univariate modules, with the consequence that the second backpropagation step can potentially render the bivariate term not statistically significant and replace it with one or more univariate independent effects.

#### 2.1.4. Summary of the Method

The following is the pseudo-code for PRiSM models, taking as a starting point a data set comprising a set of P-dimensional input variables normalised according to (13) and the corresponding black box model predictions.

Algorithm Partial Response Models *PRiSM(BB, D):*

**Input:** set *D* of training examples; predictions *P(C|x)* from a pre-trained black box model *BB.*

**1. ANOVA decomposition:** apply the recursion given by Equation (6) to the *logit(P(C|x).*

This may use either of the suggested measures, Dirac and Lesbesgue. The former leads to an anchored decomposition referenced to the choice of anchor point; the component functions generated by the latter are summations weighted by the density function of the covariates, *P(x)*.

**2. Model selection with the Lasso**: input the set of univariate and bivariate partial responses *ϕi*(*xi*) and *ϕij*- *xi*, *xj* from (6) calculated over the training data set *D* as new inputs for variable selection with the logistic regression Lasso.

The Lasso will also output a linear coefficient for each partial response, *β<sup>i</sup> and βij*, as well as an intercept *β*0, generally resulting in good calibration.

**Output** *prBB(BB, D)***:** this is the output of the Lasso in Step 2, which has the form of a GAM, shown in Equation (5), truncated to the selected subset of functions *ϕi*(*xi*) and *ϕij*- *xi*, *xj* :

$$\log \text{it} (prBB(\mathbb{C}|\mathbf{x})) \equiv \varphi(0) + \sum\_{i} \beta\_{i} \varphi\_{i}(\mathbf{x}\_{i}) + \sum\_{i \neq j} \beta\_{ij} \varphi\_{ij}(\mathbf{x}\_{i}, \mathbf{x}\_{j}) \tag{27}$$

Each partial response comprises a non-linear function of its arguments. Consequently, the model prediction equals the sum of all partial responses plus the intercept, weighted by the linear coefficients from Step 2, followed by the application of the sigmoid function, which inverts the *logit(P(C|x))*.

The predictions for anchored decompositions are indexed by the pre-fix *pr* followed by an abbreviation of the black box algorithm, e.g., prSVM and prGBM.

**3. Predictions with** *PRiSM models***:** given a test data point, the *ϕi*(*xi*) and *ϕij*- *xi*, *xj* are calculated using Equations (14)–(16) or (17)–(20), and the predicted output follows from (27). The input variables are, therefore, directly linked to the predictions through interpretable functions.

Anchored decomposition applied to the MLP:

By denoting the output *prMLP (MLP, D)* by *MLP-Lasso*, it is then possible to continue training as follows:

**4. Map the MLP-Lasso into a GANN/SENN**: this has the form of a GANN/SENN, meaning that it is not fully connected, as shown in Figure 2. The adjustments to the weights are explained in Equations (21)–(26) and in Section 2.1.3.

**Output Partial Response Network [PRN]:** having initialised a structured neural network in the previous step, so that it exactly replicates the component functions and output of the MLP-Lasso, back-error propagation is applied to continue the training of this network. The PRN is a probabilistic binary classifier, so training will use the log-likelihood cost function. Note that the component functions no longer conform with the requirements of an ANOVA decomposition as they will have been adjusted without the constraint of orthogonality.

**Output PRN-Lasso:** Steps 1,2. are then applied to the PRN instead of the original MLP. This generates a new set of partial responses *ϕ*∗ *<sup>i</sup>* (*xi*) and *ϕ*<sup>∗</sup> *ij*- *xi*, *xj* and corresponding coefficients *β*∗ *<sup>i</sup> and β*<sup>∗</sup> *ij* from which the model predictions follow by inserting these coefficients and partial responses into Equation (27).

The second training iteration enables the partial responses to be refined without being adjusted for input variables that were removed from the model by the Lasso in Step 2. Therefore, the PRN–Lasso–BB models will always comprise a subset of the variables in the PRN–BB models, but not necessarily with the same functional form. It is possible that some of the partial responses in the PRN–BB model are no longer selected and even that what may have started as a pure two-way effect can now be split into two independent main effects represented by univariate partial responses.

#### 2.1.5. Exact Calculation of Shapley Values

In common with GAMs generally, the interpretation of a PRiSM model is the model itself, since the components of the logit are additive and the amount contributed by each partial response is clearly quantified.

In addition, the relevance of individual input variables can be calculated using Shapley values [22]. This can be achieved for the overall prediction of the probability of class membership, but also for the logit, which places less emphasis on the non-linearity at the class boundary but takes greater account of the value of the logit across the full range of input values, which is important to ensure the good calibration of the final model.

When the logit has the form shown in Equation (27), the Shapley value *φ<sup>i</sup>* corresponding to input variable *xi* of dimensionality *P* can be efficiently computed by summing over all variable subsets that exclude input *<sup>i</sup>*, *<sup>S</sup>* <sup>⊆</sup> *<sup>P</sup>*\{*i*}, including *<sup>S</sup>* <sup>=</sup> <sup>∅</sup>, with the usual formula

$$\phi\_i(\mathbf{x}\_i) = \frac{1}{P} \sum\_{\mathbf{S} \subseteq \mathcal{N}(\{i\})} \binom{P-1}{|\mathcal{S}|}^{-1} (\log \text{it}(P|\mathbf{C}|S \cup \{i\} - \log \text{it}(P|\mathbf{C}|S) \text{ })) \tag{28}$$

The linear terms in (27) simply add for every combination of input variables excluding *i,* of which there are *P* − 1 |*S*| ; therefore, at each data point the contribution to *φi*(*xi*) is just

*βi*(*ϕi*(*xi*) − *ϕi*(0)). In the case of the bivariate terms, the calculation is similar but involves only *P* − 2 |*S*| combinations giving

$$\frac{1}{P} \sum\_{j=1}^{P-1} \left[ \frac{\binom{P-2}{j-1}}{\binom{P-1}{j}} \right] = \frac{1}{P} \sum\_{j=1}^{P-1} \left[ \frac{(P-2)!j!}{(P-1)!(j-1)!} \right] = \frac{1}{P} \sum\_{j=1}^{P-1} \left( \frac{j}{P-1} \right) = \frac{1}{2} \tag{29}$$

Therefore, the pairwise terms neatly share their impact among the Shapley values for each of the variables, yielding

$$\phi\_i(\mathbf{x}\_i) = \beta\_i(\boldsymbol{\varphi}\_i(\mathbf{x}\_i) - \boldsymbol{\varphi}\_i(0)) + \frac{1}{2} \sum\_j \beta\_{ij} (\boldsymbol{\varphi}\_{ij}(\mathbf{x}\_i, \mathbf{x}\_j) - \boldsymbol{\varphi}\_{ij}(0, \mathbf{x}\_j)) \tag{30}$$

#### 2.1.6. Experimental Settings

All algorithms were implemented in Matlab [25]. The MLP was trained with automatic relevance determination [26] implemented in Netlab [27], although conjugate gradient descent leads to similar results. The other machine learning algorithms predict scores for a class assignment. In this study, the scores are calibrated for probabilistic classification using the score as the sole input to a logistic regression model, which forms the starting point for PRiSM models by taking the logit in the same way as for the MLP. The SVM was fitted using *fitcSVM* with an RBF kernel and automatic optimisation. The GBM model is implemented with *fitcensemble*, which boosts 100 decision trees using the function *LogitBoost*.

#### *2.2. Data sets used*

#### 2.2.1. Synthetic Data

(a) 2D circle. We implemented a sample size of n = 10,000 with unbalanced classes, which is the case for all synthetic data sets in this paper. In this case, the *logit* has two separate univariate components,

$$\text{logit}(P(\mathbb{C}|(\mathbf{x}\_1, \mathbf{x}\_2))) = 10 \times \left[ \left(\mathbf{x}\_1 - \frac{1}{2}\right)^2 + \left(\mathbf{x}\_2 - \frac{1}{2}\right)^2 - 0.08 \right] \tag{31}$$

This data set is similar to that used in [15] but instead of generating clean data by allocating different classes on either side of the boundary, we use noisy data by generating binary targets with a Bernoulli distribution, which is also common for all the synthetic data sets that we report,

$$Y \sim \operatorname{Bin}\left(n, P(\mathbb{C}|(\mathfrak{x}\_1, \mathfrak{x}\_2))\right). \tag{32}$$

The factor of 10 in Equation (31) is to reduce the amount of noise and so ensure a reasonable value for the AUC. The values of (*x*1, *x*2) are generated using *xi* = 0.5 × *(ui + w)*, where both *ui* and *w* are uniform distributions in the range [0,1], to demonstrate the prediction accuracy when the two input variables are correlated. There are only two univariate main effects and no interaction term.

(b) XOR function. The purest bivariate interaction is the XOR, represented in the multilinear form appropriate for continuous Boolean algebra [28],

$$P(\mathbb{C}|(\mathbf{x\_3}, \mathbf{x\_4})) = \mathbf{x\_3} + \mathbf{x\_4} - 2\mathbf{x\_3}\mathbf{x\_4}, \mathbf{x\_i} \in ]0, 1[\tag{33}$$

Each variable will be generated by a uniform distribution in [0,1]. This density function has the property that

$$\log \text{it}(P(\mathbb{C}|(\mathbf{x\_3}, \mathbf{x\_4}))) = \log \left( \frac{\mathbf{x\_3} + \mathbf{x\_4} - 2\mathbf{x\_3}\mathbf{x\_4}}{1 - \mathbf{x\_3} - \mathbf{x\_4} + 2\mathbf{x\_3}\mathbf{x\_4}} \right) \tag{34}$$

Therefore, *logit P C* " " " *x*3, <sup>1</sup> 2 <sup>=</sup> <sup>0</sup> making it a pure interaction for the ANOVA decomposition with the *Dirac measure* anchored at (1/2,1/2). Similarly, for the *Lebesgue measure,* it is readily shown that

$$\log \text{it}(P(\mathbb{C}|\mathbf{(x\_3, x\_4)}) = -\log \text{it}(P(\mathbb{C}|(1 - x\_3, x\_4)) \tag{35}$$

which is similar to the other dimension; therefore, the integrals corresponding to the univariate terms vanish, once again leaving the pure interaction term.

(c) Logical AND function. A combination of univariate and bivariate terms by generating data according to the logical AND function, which in continuous Boolean algebra is represented by the following atomic term:

$$P(\mathbb{C}|(\mathbf{x}\_5, \mathbf{x}\_6)) = \mathbf{x}\_5 \mathbf{x}\_6, \mathbf{x}\_i \in ]0, 1[\tag{36}$$

This time, the ANOVA expansion with the *Dirac measure* anchored at (1/2,1/2) is

$$\begin{array}{l} \log \text{if} \left( P(\mathbb{C}(\langle \mathbf{x\_5}, \mathbf{x\_6} \rangle)) \right) \\ = -\log(3) + \sum\_{i} \left[ \log \left( \frac{\mathbf{x\_i}}{2 - \mathbf{x\_i}} \right) + \log(3) \right] \\ + \left[ \log \left( \frac{(2 - \mathbf{x\_5})(2 - \mathbf{x\_6})}{1 - \mathbf{x\_5}\mathbf{x\_6}} - \log(3) \right) \right] \end{array} \tag{37}$$

The *Lebesgue measure* yields univariate terms given by

$$\boldsymbol{\phi}\_{i}^{\text{Lebresgu}} = \log \left( \mathbf{x}\_{i} / (\mathbf{1} - \mathbf{x}\_{i}) \right) + \frac{\log(1 - \mathbf{x}\_{i})}{\mathbf{x}\_{i}} + \text{Li}\_{2}(\mathbf{1}) \tag{38}$$

where *Li*2(1) is the polylogarithm function of second order evaluated at 1. The bivariate term is given by the explicit ANOVA decomposition in Equation (5) and does not reduce to a simpler algebraic form.

(d) Three-way interaction. The final synthetic data set comprises a data set that cannot be modelled with univariate and bivariate terms only. The purpose is now to see how well PRiSM models work to model a high-order effect.

$$P(\mathbb{C}|(\mathbf{x}\_{7}, \mathbf{x}\_{8}, \mathbf{x}\_{9})) = \mathbf{x}\_{7} \mathbf{x}\_{8} \mathbf{x}\_{9}, \mathbf{x}\_{i} \in ]0, 1[\tag{39}$$

Note that the complete set of input variables for the synthetic data set is calculated only once. In this way, the same sample of 9-dimensional input data will be used for all classifiers. Only the target classes differ, thus generating four separate binary classification tasks. In each case, two or three variables will carry signal and the others comprise noise. A minimum requirement of all classifiers is to identify the relevant input dimensions and discard the rest. In addition, since we have the data generators, we can calculate the optimal classification performance corresponding to allocating every data point to the correct class, irrespective of the stochastic label generated by the Bernoulli distribution.

#### 2.2.2. Real-World data

A description of the variables included in the starting pool for model selection and any standardisation that was applied to them is provided below.

#### (a) Diabetes data set:

The Diabetes dataset [29,30] comprises measurements recorded from 768 women, who were at least 21 years old, of Pima Indian heritage, and tested for diabetes using World Health Organization criteria. One of the variables, "Blood Serum" Insulin, has significant amounts of missing data. These rows were removed along with all entries with missing values of "Plasma Glucose Concentration" in a tolerance test, "Diastolic Blood Pressure" (BP), "Triceps Skin Fold Thickness" (TSF) or "Body Mass Index" (BMI), resulting in a reduced data set with n = 532. In line with common practice, a subset was randomly

selected for training (n = 314), and the remaining were used for testing (n = 268). The additional variables available are "Age", "Number of Pregnancies", and "Diabetes Pedigree Function" (DPF), a measure of family history of diabetes. A binary target variable indicated diabetes status, with a positive prevalence of 35.7%.

(b) Statlog German Credit Card data set:

We used the numerical version of the Statlog German Credit Card database [31], which contains n = 1000 instances and 24 attributes. The first 700 observations were used for training, with a prevalence of bad credit risks being 29.6%. The remaining 300 observations were used for testing, with a prevalence of bad risks of 31%. The data set was used in the form created for the benchmarking study Statlog, where three categorical variables ("Other Debtors", "Housing", and "Employment") were coded in binary form with multiple columns.

(c) Statlog Shuttle data set:

The Statlog Shuttle database [31,32] from NASA comprises 9 numerical attributes and an outcome label. It is split into 43,500 cases for training and 14,500 for testing. There are 7 outcomes, of which 21% are in the category "Rad Flow". The binary classification task is to separate this category, Class 0, from the others, assigned to Class 1. Given the strong imbalance between classes, the default accuracy for a null model, i.e., predicting the predominant class for all rows, is 79%. The target accuracy is 99–99.9%.

#### **3. Results**

The following sections compare the performance and characteristics of different PRiSM models obtained by opening a range of frequently used black box algorithms.

#### *3.1. Synthetic Data*

The purpose of the benchmarking on the synthetic data is to ascertain how close each machine learning classifier and the corresponding interpretable PRiSM models get to the optimal classification accuracy, which is obtained using the known class membership probabilities given by the generating formulae for class membership, notwithstanding the presence of noise in the targets.

The classification performance of frequently used machine learning models and their interpretable versions applied to the four synthetic sets are listed in Tables 1–4. The optimal AUC values are in bold, and the values below the confidence interval (CI) are in italics. Two-dimensional plots of the relevant variables from the nine input dimensions are plotted in Figures 3–5, showing the actual training data with Bernoulli noise and the ideal class allocations used to find the best achievable AUC. The generated data set was split into three parts for training, model parameter optimisation with out-of-sample data, and performance estimation in generalisation. It is interesting to see how much the optimal AUC varies between three slices from the same noisy data. This illustrates the importance of calculating confidence intervals. The model with marginally the best point estimate of the AUC for the optimisation data may not have the highest AUC estimated on the independent sample.

**Table 1.** Classification performance for the 2D circle measured by the AUC [CI]. The input variables x1 and x2 are ideally selected solely for their univariate responses.



**Table 2.** Classification performance for the XOR function measured by the AUC [CI]. The input variables x3 and x4 are ideally selected solely for their bivariate response.


**Table 3.** Classification performance for the logical AND function measured by the AUC [CI]. The input variables x5 and x6 are ideally selected with two univariate responses and a bivariate response.



**Table 3.** *Cont.*

**Table 4.** Classification performance for the three-way interaction measured by the AUC [CI]. Three input variables are involved, x7, x8, and x9.


**Figure 3.** Class allocation for the 2-circle synthetic data set as a function of x1 and x2 showing: (**a**) The stochastic class labels; and (**b**) The correct classes that are used to find the optimal AUC.

**Figure 4.** Class allocation for the XOR data set as a function of x3 and x4 showing: (**a**) The stochastic class labels; (**b**) The correct classes used to find the optimal AUC; (**c**) The two-way interaction term identified by the Dirac measure; and (**d**) The interaction estimated with the Lebesgue measure, which is almost identical to the curve in (**c**). Both surfaces are the only terms in the GAM, and closely correspond to the logit of the ideal XOR prediction surface. The main difference to theory is that the values at the four corners that saturate at finite values, whereas in theory, they extend to infinity in both vertical directions. This, however, has little impact on the crucial region for classification, which is the class boundary.

**Figure 5.** Class allocation for the synthetic data set representing the logical AND a function of x5 and x6 showing: (**a**) The stochastic class labels; and (**b**) The correct classes to find the optimal AUC.

The MLP-derived PRiSM models identified the correct ANOVA components for all data sets, with both the Dirac and Lebesgue measures. The three-way interaction term is a product of three inputs, not a pure interaction term. Therefore, its decision boundary can be approximated even in the absence of a third-order term, with three univariate partial responses sufficient to get close to the optimal prediction accuracy. Note that the predicted response in Figure 4c,d for the XOR task is very close to the bilinear surface corresponding to the theoretically correct response given by multilinear algebra (Tsukimoto, 2000).

Some machine learning models can be prolific in model selection with the ANOVA decomposition, followed by the Lasso, for either measure. The models always include the relevant variables but may also suffer from overfitting. However, it is remarkable how the interpretable models frequently achieve AUC values within 1% of the optimal value.

The models that filtered out the correct number of components to model each data set all have consistent interpretations. The partial responses correspond to the data generators in the vicinity of the class boundary, although the responses tend to level off away from the boundary, where the precise value of the logit is less important since the class membership probabilities are close to zero or one.

#### *3.2. Real-World Data*

The benchmarking results for the interpretable models against the original black box classifiers are summarised in Table 5. Values below the CI of the AUC are in italics.

**Table 5.** Classification performance for the real-valued data sets. The label 'D' indicates the number of input variables for the black boxes and component functions for the PriSM models.


\* Indicates a two-stage model selection process, explained in the text.

All methods use the same data sets, and the AUCs are quoted for test data only. Measuring statistical significance with the McNemar test shows that the performance difference between any pair of models is not significant at the 5% level.

While the accuracy of all models is comparable, the PRiSM models use fewer variables and are intuitive to interpret. It is also apparent that the two different measures lead to very similar classification performances. The coefficients of the Lasso used for re-calibration are close to unity for all models.

The number of component functions in Table 5 shows the effect of variable selection by the Lasso. The Diabetes data set generates only univariate responses. However, the Credit Card and Shuttle data sets require two-way interactions, as well as univariate effects. Note that the Credit Card data set generates 300 partial responses to choose from.

The GAMs, seeded by the SVM and GBM, are calibrated by the LASSO, resulting in the prSVM and prGBM. The univariate and bivariate structure of these models can be used to define a PRN model, which is a SENN with MLP components, initialised either with random weights or with univariate and bivariate modules trained to replicate each of the selected partial responses. This will replicate the PRN and, following orthogonalization, the PRN–Lasso.

The sparsity of the models and their potential for interpretation are illustrated by the partial responses of two models, the MLP–Lasso and the PRN, shown in Figures 6–11. These functions are derived from the training data and are always used for prediction on out-of-sample data. The corresponding component functions for the other PRiSM models have similar values, although, if derived from random forests, e.g., in the case of the prGBM, they are stepwise constant rather than smooth. This is shown in [20] for a different data set.

**Figure 6.** Contributions to the logit from partial responses to the logit (left axis) for the Diabetes data set, obtained with the Dirac measure, overlapped with the histogram of the training data (right axis). The final partial responses derived at the second application gradient descent (solid lines) are shown alongside the partial responses from the original MLP (dashed lines). Five covariates are represented, namely (**a**) Pregnancies, (**b**) Glucose, (**c**) BMI, (**d**) DPF and (**e**) Age.

**Figure 7.** As for Figure 6, with the Lebesgue measure, the component functions of the GAM are very similar for both measures. They have a similar structure and range of contributions to the logit. Despite being fitted with a generic non-linear model, the MLP, several of the partial responses are linear. Variable "DPF" shows a saturation effect, as might be expected, while the log odds of "Age" as an independent effect peak around the age of 40. Note that data sparseness for higher values will result in greater uncertainty in the estimation of the partial response. The same size covariates are represented (**a**–**e**) as in Figure 6.

**Figure 8.** Partial responses for the German Credit Card data set, using the same notation as the previous figures. Four univariate responses and a bivariate response are shown namely for the covariates (**a**) Status of checking account, (**b**) Duration of loan, (**c**) Credit history and (**d**) Credit amount, together with (**e**) the pairwise interaction between Credit amount and Duration of loan.

**Figure 9.** As for Figure 8, with the Lebesgue measure for the same covariates in (**a**–**d**), but with two pairwise interactions involving the variables listed in (**e**,**f**). Despite the different nature of the two measures, they offer entirely consistent interpretations, with the only difference being the selection by the Lasso model of a second bivariate interaction term, albeit with a range in contribution to the logit that is five times smaller than for the interaction term involving "Credit amount" and "Duration".

**Figure 10.** Nomogram of the PRN–Lasso model obtained for the Statlog Shuttle data set using the Dirac measure with a training/test split of n = 43,500 and 14,500, respectively: (**a**) Shows the raw data for the two variables selected, which corresponds well with two partial responses in the final model, namely: (**b**) shows the main effect involving x9; and (**c**) plots the two-way interaction between the two variables in the model, x1 and x9.

Among the seven covariates in the Diabetes data set, five occurred together as univariate responses in all of the random initialisations for the MLP–Lasso, PRN–Lasso, and the two measures. They are "Pregnancies", "Glucose", "BMI", "DPF", and "Age". An interaction term involving "Glucose" and "DPF" was present in three random initialisations. The set of models obtained is, therefore, remarkably stable. The partial responses for the recurrent univariate effects are shown in Figures 6 and 7.

The German Credit Card data set is more challenging. Out of 24 variables, six were present in all initialisations for both measures: "Duration", "Credit history", "Savings accounts", "Period of employment", and the two variables labelled x16 and x17. In the case of the Lebesgue measure, three more variables recurred in all 10 initialisations, namely "Status of checking account", "Other instalment plans", and "Worker status". In addition, the variable "Credit amount" featured as a univariate or a bivariate term in eight initialisations. These ten variables were selected to obtain the models for which a selection of component functions is shown in Figures 8 and 9.

Four univariate functions for multi-valued input variables are consistently monotonically decreasing and very close to linear, suggesting that these indicators are well calibrated as independent effects on credit risk, quantifying reductions in risk with rising input values. "Duration" shows saturation in its contribution to risk for large values, and "Credit amount" has a non-linear response with a minimum value. The bivariate responses suggest that to optimise the overall calibration of the model, adjustments are required in addition to the main effects. This includes a risk reduction when the "Credit amount" and "Duration" are both high, and a slight enhancement when either is small compared with the median value.

After mapping the structure derived with the Dirac measure from the prSVM onto the PRN–Lasso, good discrimination was achieved with an AUC of 0.812 [0.755,0.869] with just ten univariate effects. They comprised the six variables identified also by the PRN, together with the variables 'Personal status', 'Property', and 'Other instalment plans'.

For both the Diabetes and Credit Card data, the other benchmarked algorithms, SVM and GBM, generally select more components than the MLP and have worse generalisation performance, as evident from Table 5. If the partial response models derived from each machine learning algorithm are mapped onto a SENN and further trained, then their performance becomes similar for all of the models and they select consistent input variables, although some models will include additional ones.

The scalability and power of the method can be illustrated using the Statlog Shuttle data set. This data set is challenging because all of the variables have non-normal distributions, often with very peaked histograms, hence very small entropy. Two of the variables, x5 and x9, have a Pearson correlation of −0.875.

When applying MLP, the weight decay parameters estimated for variables x2, x4, and x6 are noticeably larger than the others, indicating that these variables are less informative. They were, therefore, removed from the data. In the case of the Dirac measure, univariate component functions for x1 and x9 were selected by the PRN–Lasso with an AUC of 0.996 [0.994,0.998]. Selecting just these two variables as the inputs resulted in the performance listed in Table 5, involving a univariate effect for x9 together with the interaction between x1 and x9. The Lebesgue measure behaved similarly but for the same Lasso selection procedure, and included also a univariate effect for x1 albeit without an appreciable performance improvement.

The prSVM model selection process also followed a two-stage process, ending with the same two variables, x1 and x9, for both measures, each time involved in two univariate effects and a bivariate term. Interestingly, the prGBM model converged straight away on the two-component solution involving a univariate effect for x9 and an interaction between x1 and x9 with the Dirac measure; with the Lebesgue measure, it converged on two univariate effects.

#### **4. Discussion**

A potential advantage of the PRN model over other PRiSM models in their current formulation is that the PRN allows for a second step of training, where the univariate and bivariate responses are re-estimated without the constraints of the ANOVA framework. The PRN–Lasso then applies the ANOVA framework to the refined univariate and bivariate models obtained by the PRN, followed by re-calibration, using logistic regression with L1 regularisation.

We now turn to three key questions for explainable machine learning methods: the accuracy of the resulting models, their stability in model selection and their interpretability. In particular, accuracy and stability are critical requirements for any interpretable model.

#### *4.1. Predictive Accuracy*

The ability to model data depends primarily on the capacity of the machine learning algorithm to fit the data structure given the observational noise. Each of the methods shown is capable of fitting the benchmark data although some of the point estimates of the AUC are close to the confidence limit boundaries of the best performing methods. Therefore, the different methods have different efficiencies for modelling specific data sets. However, all of the models have comparable performance, which is uniformly high and with no evidence of an interpretability vs. performance trade-off.

#### *4.2. Stability*

The discussion of the empirical results shows that model selection is stable for multiple random initialisations of the MLP algorithm. This suggests that the PRiSM framework resolves two major limitations of the MLP: different predictions for multiple initialisations and lack of interpretation. It turns out that re-shaping the MLP to become a SENN also stabilises the predictions for different initialisations.

Stability is also good between models, with consistency between the input variables selected by all of the methods. This is perhaps most striking for the Shuttle data, where all of the methods picked the two key variables and identified an important interaction between them from a highly non-linear but remarkably noise-free set of measurements.

#### *4.3. Interpretability*

This can be referred to a formal framework involving the three Cs of interpretability [23]:

	- They are sufficiently correct to ensure good calibration for all data sets. This means that deviations from the theoretical curves for the synthetic data occur in regions where the model is close to saturated, i.e., making predictions close to 0 or 1;
	- The label coherence of the instances covered by the explanation is assured by the shape of the component functions so that the neighbouring instances have similar explanations.

The partial responses for the real-world data sets are plausible. In the case of the medical and credit card classifiers, some variables show remarkably linear dependence over their full range, while others are monotonic but their values saturate, showing a levellingup beyond a certain point. In some cases, there is a turning point, and, interestingly, this was seen consistently for the two measures, e.g., for the added risk associated with credit amount, shown in Figures 8d and 9d. There is also a clear impact from data sparsity, which causes variability in the component functions in the less densely sampled regions of the data.

A more thorough appraisal of the plausibility of a PRiSM model using the Dirac measure applied to heart transplants is discussed in [21].

Finally, we note that PRiSM models are counterfactual because their predictions are directly connected to the input values. In the case of the PRN, the logit of the probabilistic prediction is simply the sum of the univariate and bivariate responses, whereas for the MLP–Lasso, PRN–Lasso, and the remaining PRiSM models seeded by other machine learning algorithms, the prediction is the sum of the response functions re-scaled by the linear coefficients of the Lasso.

#### **5. Conclusions**

We propose ANOVA decompositions of multivariate logit functions into sums of functions of fewer variables as a computationally efficient way to open probabilistic black box binary classifiers. Empirical results on the synthetic and real-world data show that the resulting interpretable models do not suffer from the interpretability vs. performance trade-off when applied to tabular data. Moreover, two alternative measures, Dirac and Lebesgue, lead to consistent interpretations for any given data set. The proposed method is accurate, stable, and scalable. Benchmarking it against a range of machine learning algorithms confirms this.

This paper formalizes the complete framework for the derivation of GAMs from black box classifiers, links the formalism to a commonly used attribution measure, Shapley values, and demonstrates its compliance with a user-led interpretability framework [23]. The paper extends previous results from related work focusing on clinical interpretations of a particular realization of the framework with anchored decomposition [20,21]. An unexpected finding of this study is that although the two measures are distinct, in that the Dirac measure represents a cut through a response surface at a particular point in the data and so is dependent on the choice of anchor point, while the Lebesgue measure integrates the surface over the range of the data and so is closer to the evaluation of size effects, both measures lead to similar interpretable models. This result is encouraging and suggests

that the PRiSM framework may be a viable method to derive globally interpretable models from arbitrary binary classifiers.

The resulting partial responses form a nomogram, which is a broadly used method of communicating complex models to users without the need for a detailed mathematical formulation [33]. We show that the components in the nomogram of a GAM agree exactly with the Shapley values, which are increasingly used for the explanation of machine learning in some high-stakes applications [34].

The component functions derived with the two measures, while close, are not identical. Further work is required to explore with end-users the preference for either measure and by which model they are seeded. In addition, confidence intervals for the univariate and bivariate terms need to be quantified. There are also clear parallels with regression models and a possible extension of the binary classifier to survival modelling within the framework of partial logistic cost functions [35]. Note that the method applies to any classifier that predicts the probability of class membership since it does not use the internal structure of the classifier but only the overall response function.

**Author Contributions:** P.J.G.L. conceptualised the method and led the study. P.J.G.L., S.O.-M. and I.O. implemented the code and ran the experiments. B.W. contributed to the discussions, implementation, and experiments. All authors evaluated the results. P.J.G.L., S.O.-M. and I.O. drafted the early versions of the manuscript. All authors contributed to the writing, reviewing, and editing, and approved the final manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This was partially funded by Liverpool John Moores University via a PhD scholarship.

**Data Availability Statement:** The real-world data analysed in the current study are available in Kaggle, the UCI Machine Learning repository, and PhysioNet, as follows: Diabetes (Pima Indians diabetes database): https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database (access on 11 October 2021) German Credit Card (Statlog German credit card data set): https://archive.ics. uci.edu/ml/datasets/statlog+(german+credit+data) (access on 11 October 2021). Statlog Shuttle data set: https://archive.ics.uci.edu/ml/datasets/Statlog+(Shuttle) (access on 11 October 2021). Medical Information Mart for Intensive Care (MIMIC-III): https://physionet.org/content/mimiciii/1.4/ (access on 11 October 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
