*Article* **A Rotational Invariant Neural Network for Electrical Impedance Tomography Imaging without Reference Voltage: RF-REIM-NET**

**Jöran Rixen 1,\*, Benedikt Eliasson 1, Benjamin Hentze 1,2, Thomas Muders 2, Christian Putensen 2, Steffen Leonhardt <sup>1</sup> and Chuong Ngo <sup>1</sup>**

	- 53127 Bonn, Germany; thomas.muders@ukbonn.de (T.M.); christian.putensen@ukbonn.de (C.P.)

**Abstract:** *Background*: Electrical Impedance Tomography (EIT) is a radiation-free technique for image reconstruction. However, as the inverse problem of EIT is non-linear and ill-posed, the reconstruction of sharp conductivity images poses a major problem. With the emergence of artificial neural networks (ANN), their application in EIT has recently gained interest. *Methodology*: We propose an ANN that can solve the inverse problem without the presence of a reference voltage. At the end of the ANN, we reused the dense layers multiple times, considering that the EIT exhibits rotational symmetries in a circular domain. To avoid bias in training data, the conductivity range used in the simulations was greater than expected in measurements. We also propose a new method that creates new data samples from existing training data. *Results*: We show that our ANN is more robust with respect to noise compared with the analytical Gauss–Newton approach. The reconstruction results for EIT phantom tank measurements are also clearer, as ringing artefacts are less pronounced. To evaluate the performance of the ANN under real-world conditions, we perform reconstructions on an experimental pig study with computed tomography for comparison. *Conclusions*: Our proposed ANN can reconstruct EIT images without the need of a reference voltage.

**Keywords:** artificial intelligence; deep learning; Electrical Impedance Tomography; lung imaging; cardiopulmonary monitoring

#### **1. Introduction**

Electrical Impedance Tomography (EIT) enables the non-invasive visualization of the dielectric properties of a medium of interest. EIT has a wide range of applications, including the status monitoring of concrete [1], the monitoring of semiconductor manufacturing [2], and observing cell cultures [3]. In the medical domain, the applications are broader, and include the monitoring of lung recruitment and collapse [4], lung ventilation [5] and perfusion, the monitoring of 3D brain activity [6], size and volume estimation of the bladder [7], breast cancer imaging [8], and cardiopulmonary monitoring [9]. Here, EIT can be used to assess metrics such as regional ventilation, end-expiratory lung volume, compliance, regional respiratory system compliance, and regional pressure–volume curves [9].

The versatility of EIT stems from the fact that the measurements can be made noninvasively and inexpensively. For an image to be reconstructed, electrodes need to be placed around the domain. Small, low-frequency currents in the range of 100 kHz are fed through these electrodes. Then, the voltage across the electrodes is measured, and an image is reconstructed. Despite the advantages of EIT, it has one major drawback: it suffers from a relatively low spatial resolution.

**Citation:** Rixen, J.; Eliasson, B.; Hentze, B.; Muders, T.; Putensen, C.; Leonhardt, S.; Ngo, C. A Rotational Invariant Neural Network for Electrical Impedance Tomography Imaging without Reference Voltage: RF-REIM-NET. *Diagnostics* **2022**, *12*, 777. https://doi.org/10.3390/ diagnostics12040777

Academic Editors: Sameer Antani and Sivaramakrishnan Rajaraman

Received: 11 February 2022 Accepted: 19 March 2022 Published: 22 March 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

This issue is due to the fact that EIT image reconstruction belongs to the class of inverse problems [10]. Large changes in the conductivity of the medium may lead to only small changes in the voltage measurements. To still be able to solve the problem, different types of algorithms have been proposed in the literature. From a mathematical perspective, three different types of algorithms can be distinguished.

The first set of algorithms is variational regularization methods. Their goal is to minimize a cost function that contains two parts. First, the physical behavior of the medium of interest is modeled. Given a set of voltage measurements, the algorithm helps to find the best fit for the conductivities that could produce these voltages. Second, the regularization strategy is applied, which plays a crucial role in finding a valid solution. Two common examples of regularization strategies are total variation [11] and the Tikhonov regularization [12].

The second type of algorithms is statistical inversion methods. Here, image reconstruction is modeled as a problem of statistical inference. The measurements and conductivities are modeled as random variables from which an a posteriori distribution can be estimated, through, e.g., Markov Chain Monte Carlo iterations. From this, the conductivity can be derived [13]. This can be accomplished, by, for example, first obtaining a starting distribution through the one-step Gauss–Newton method. Thereafter, Markov Chain Monte Carlo methods can be used to refine the starting distribution [14].

The final type is direct inversion algorithms. In these methods, the problem is analyzed through the partial differential equations governing the system behavior. From this, a solution strategy is developed. An example of these kinds of methods is the D-Bar algorithm [15].

Artificial neural networks belong to the variational regularization methods, as they solve the optimization problem once during training and then act like a complex look-up table. The regularization performed by artificial neural networks is not straightforward: first, the neural network architecture provides a part of the regularization. A very deep architecture may provide sophisticated results for the training data set, but may lead to profound over fitting, such that the results for slightly different data bring far worse results. The second part of the regularization comes from the training data. There is no reference technique to capture the conductivity distribution of body tissue. Thus, in EIT the training data are simulated with the help of, for example, finite element method (FEM) software such as EIDORS [16]. However, for simulations a multitude of assumptions have to be made: What does the model shape look like? What are the electrode positions? Do they change? What shape do the conductivity enclosures have? What is the range of conductivity? All of these assumptions act as some kind of regularization.

Artificial neural networks are beginning to gain more relevance in the field of EIT. In 2017, Kłosowski and Rymarczyk [17] presented an ANN with fully connected layers and convolutional layers. However, the proposed ANN can only reconstruct single targets. Their outputs are the coordinates and the radius of the conductivity enclosure. Other approaches used ANNs to enhance the reconstructions of traditional EIT reconstructions [18]. In 2019, Hu et al. [19] used the spatial invariant properties of the EIT to improve upon these results. However, to aid in the reconstruction, their approach is based on calibration. Thus, their artificial neural network is not usable when the background data are missing. By contrast, Chan et al. [20] proposed a network which does not need this preprocessing. However, the structure of the artificial neural network does not account for the symmetry of EIT measurements. We settled for artificial neural networks, as they have been used in the past within the domain of EIT and show the greatest potential due to their ability to recreate non-linear functions.

In the following, we propose an artificial neural network structure which can reconstruct images without dependence on a reference voltage, while still using the rotational symmetry of EIT adjacent measurements in adjacent drive. We call this structure the **R**eference **F**ree **R**otational **E**lectrical **I**mpedance **M**ap **Net**work (RF-REIM-NET).

The novelties of this research are:


#### **2. Materials and Methods**

#### *2.1. Fundamentals*

In EIT, the goal is to find an optimal conductivity distribution given a set of voltage measurements. When using variational reconstruction methods, this is expressed as

$$\sigma\_{\rm rec} = \arg\min \frac{1}{2} \left\| F(\sigma) - V\_{\rm meas} \right\|^2 + \lambda \left\| L\sigma \right\|^2,\tag{1}$$

where *<sup>σ</sup>* is the conductivity, *<sup>F</sup>*(*σ*) is the forward model, *<sup>V</sup>meas* is the measured voltage, *<sup>λ</sup>* is the weighting of the regularization term and *L* is the regularization matrix. When using artificial neural networks (ANNs), the general scheme of this minimization is still true; however, it is achieved differently. While variational reconstruction methods minimize each measurement according to Equation (1), ANNs will perform the minimization on a given dataset. This can be formulated as

$$\arg\min \frac{1}{2} \left\| \mathbf{Y}'(\mathbf{V}\_{\text{meas}}) - \mathbf{Y} \right\|^2,\tag{2}$$

where *Y* denotes the ground truth value and *Y* (*Vmeas*) denotes the ANNs output depending on the input of the network. During runtime, the ANN behaves deterministically like a look-up table. When assuming that the dataset represents real measurements, the ANN still minimizes the actual measurements.

#### *2.2. Electrical Impedance Maps*

Hu et al., pointed out the advantages of packing EIT measurements into the electrical impedance map (EIM). EIMs can be used to represent EIT data in adjacent–adjacent measurement mode. For 16 electrodes, the data are represented in a 16 × 16 matrix; see Figure 1. Along the matrix column are the measurement electrodes, while the excitation electrodes are arranged along the rows. *EIM*[*j*, *<sup>k</sup>*] contains the measurement of the *<sup>j</sup>*th electrode pair, while the *k*th electrode pair drives the current. Since four probe measurements are used, voltages from injecting electrodes cannot be used. On those spots, the *EIM* matrix is filled with zeros, causing the superdiagonal, diagonal and subdiagonal elements to become 0.

When a conductivity distribution is rotated by an angle of <sup>2</sup>*π<sup>k</sup>* <sup>16</sup> , where *k* is an integer number, the features of the *EIM* map do not change. The features are moved diagonally across the image. Thus, a convolutional ANN can extract features from the EIM independent of rotation.

measurement pair-electrodes

**Figure 1.** A 16 × 16 electrical impedance map (EIM) arrangement from an adjacent injection pattern. The zeros represent values which are not gathered from adjacent–adjacent measurement mode, as at least one electrode is used for current injection.

#### *2.3. Training Data Set*

In machine learning, the data set used for training is an important part of the algorithm's performance [21]. For EIT, there is no general high-resolution ground truth dataset. Instead, the data have to be carefully designed. It is easy and tempting to craft a data set that gives meaningful results on the available test data. If there is a relatively narrow band of possible conductivities in the test data, using this conductivity band in the simulated training data would bring a bias to the network—it might look better than it actually is.

To avoid this fallacy, we designed our data with as few assumptions as possible. We used FEM simulations to create the training data. These simulations were executed using EIDORS [16]. The first practical constraint faced was simulation time, and in general, higher mesh density is better for the quality of the simulations. However, the time taken for meshing and actual computation increases non-linearly. Thus, we used a mesh density of 0.075, while the model radius was chosen to be 28, as this is a feasible trade-off between simulation quality and computation time. Our domain shape was cylindrical. We used 16 electrodes, each with a height of 40 mm and a width of 20 mm. The electrodes were placed equidistantly around the domain. This setup was chosen as it imitates the typical measurement configuration of the clinically available device for thoracic images from Draeger (*Draeger Pulmo Vista 500*, Draeger Medical GmbH, Lübeck, Germany).

#### 2.3.1. Basic Object Shapes

To create conductivity enclosures, we used three different basic shapes: an ellipsoid, a cube and an octahedron. The basic size of these objects is 1 in all directions, and their center of gravity is in the origin of the coordinate system. To save computation time, we did not re-mesh each impedance enclosure from scratch. Instead, we created a mask for each conductivity enclosure and then changed the conductivities of mesh elements inside the mask *m*. The formulas for the three basic shapes are given as:

$$\text{mask}\_{\text{sphere}} = (x^2 + y^2 + z^2) \le 1\tag{3}$$

$$mask\_{\text{cube}} = \max(|x|, |y|, |z|) \le 1\tag{4}$$

*mask*octahedron = |*x*| + |*y*| + |*z*| ≤ 1 (5)

2.3.2. Transformation of the Basic Objects

Only inserting the same shape at the same place in the FEM model would be of no use for real-world reconstructions. Thus, the basic shapes have to be transformed. Our transformation involves the translation, rotation and scaling of the enclosures. This can be mathematically described as:

$$\mathbf{v}' = \left(\mathbf{R}(\mathbf{v} - \mathbf{t})\right) \odot \mathbf{s},\tag{6}$$

where *v* <sup>=</sup> {*x*, *<sup>y</sup>*, *<sup>z</sup>*} is the coordinate vector, *v* is the transformed coordinate vector, *t* <sup>∈</sup> <sup>R</sup><sup>3</sup> is the translation vector, *R* <sup>∈</sup> <sup>R</sup>3×<sup>3</sup> is the rotation matrix, *s* <sup>∈</sup> <sup>R</sup><sup>3</sup> is the scaling vector and denotes the element-wise division.

The positioning of the enclosures is important. As the ANN should be able to detect any conductivity enclosures in the domain with the same quality, the distribution of the object's center of gravity should be uniform across the domain. Thus, we sampled the values of *t* from a uniform distribution, such that every component of *t* <sup>=</sup> {*x*, *<sup>y</sup>*, *<sup>z</sup>*} is well inside the domain boundaries. Figure 2 gives a visual example of the transformations applied to the data.

Each entry of the scaling vector *s* is uniformly sampled between 10% and 80% of the model radius.

The angle of the rotation matrix *R* is uniformly sampled from [0, 2*π*), and thus the basic shape can be rotated in any direction. This enables the ANN to learn features that are valid for a variety of positions, as only the position of the feature changes. In Figure 2, the transformation is visualized.

**Figure 2.** An example of the random transformation used in dataset generation, applied to a circular shape. (**a**) Circle with offset. (**b**) Circle with offset and scaling. (**c**) Circle with offset, scaling and rotation.

#### 2.3.3. Conductivity Range

Another important degree of freedom is the conductivity range used. When using EIT tanks for testing, the conductivity range of the test data is typically known. Thus, it would be very tempting to just use this conductivity range for the training of the ANN. However, in practice, the conductivity is typically not known to this level of detail. Through a slice of the chest, conductivity values can range from 3.5 × <sup>10</sup>−<sup>3</sup> S/m (cortical bone) up to 4.64 × <sup>10</sup>−<sup>1</sup> S/m (deflated lung) [22]. We used a range of <sup>1</sup> × <sup>10</sup>−<sup>5</sup> S/m to 1 S/m for the background conductivity, as this covers the conductivity values typically encountered in chest measurements, while at the same time providing a margin well outside to improve generalization. The values were sampled uniformly from a logarithmic arrangement of the mentioned conductivity range. This is also known as a reciprocal or log-uniform distribution. This was chosen because the ANN should be able to differentiate objects that are one order of magnitude bigger than the background, regardless of the actual background conductivity.

The next step is an appropriate choice of the conductivity enclosures. As mentioned, it is important that the ANN is able to distinguish conductivity contrasts. At the same time, the ANN shall also be able to distinguish those contrasts symmetrically in the lower and upper bound of the conductivity range. To achieve this, the enclosure's conductivity is chosen with respect to the background conductivity. This ensures that the ANN has

no bias towards a conductivity contrast higher or lower than the background. Thus, the enclosure conductivity is chosen by multiplying the background conductivity with values from a range of <sup>1</sup> × <sup>10</sup>−<sup>2</sup> to <sup>1</sup> × 102. Again, this is sampled uniformly from the logarithmic arrangement of those values.

In a real-world setting, conductivities are rarely perfectly homogeneous across a tissue type. Because of this, the enclosures, as well as the background, are perturbed. We again scale each node of the FEM model by different values. This is achieved by using a Gaussian distribution with a *mean* value of 1. For each training sample and chosen conductivity value, a different standard deviation (*std*) from 1 × <sup>10</sup>−<sup>8</sup> to 1 × <sup>10</sup>−<sup>2</sup> was chosen. When the *std* is chosen, the values of one conductivity are perturbed by multiplication with the sampled values.

#### 2.3.4. Electrode Contact Impedance

Another effect to consider is the electrode contact impedance. Although the adjacent drive pattern used here relies solely on four probe measurements and, thus, will reduce the effect of electrode contact impedance, we included the effect into our training data. We multiplied EIDORS default contact impedance by a value randomly sampled from a Gaussian distribution with a *mean* value of 1. To simulate high and low differences, we sampled the values from three different distributions with an *std* of <sup>1</sup> × <sup>10</sup><sup>−</sup>5, <sup>1</sup> × <sup>10</sup>−<sup>3</sup> and <sup>1</sup> × <sup>10</sup><sup>−</sup>1.

#### 2.3.5. Measurement Noise

ANNs typically struggle with generalizing learned samples to cases that the ANN has not yet seen [23]. To tackle this problem, further data augmentation strategies need to be used. While the previously mentioned steps required new simulations for each training sample, the following steps rely on already simulated data. This saves computation time.

EIT measurements can be affected by several sources of noise. Paired with the illconditioned nature of the EIT problem, this can cause artifacts in the reconstruction. Often, reconstruction algorithms have a hyperparameter, which in essence balances the robustness to noise and the quality of the reconstruction. As for ANNs, the sensitivity to noise can be adjusted through the noise in the training data.

A major component in EIT systems is the analog digital converter; the noise consists primarily of thermal, jitter, and quantization noise [24]. The first two depend on the magnitude of the signals. The greater the signal, the bigger the noise. We can model this by multiplying the noise-free signal with a constant drawn from Gaussian noise:

$$\mathcal{U}^{i,j}\_{\text{ther,jit}} = \mathcal{U}^{i,j} \cdot n\_{\text{mult}\prime} \cdot n\_{\text{mult}} \sim \mathcal{N}(1, \sigma^2), \forall i, j \in \{1, \dots, 16\} \tag{7}$$

where *U*i,j ther,jit is the thermal noise-affected measurement, *<sup>U</sup>*i,j is the noise-free measurement and *n*mult is the noise sampled from a normal distribution. The quantization noise does not depend on the signal level, and can be modeled by adding a noise term to the voltage signals:

$$\mathcal{U}^{\rm i,j}\_{\rm quant} = \mathcal{U}^{\rm i,j} + n\_{\rm add}, \ n\_{\rm add} \sim \mathcal{N}(0, \sigma^2), \forall \mathbf{i}, \mathbf{j} \in \{1, \ldots, 16\} \tag{8}$$

where *U*i,j quant is the quantization noise-affected measurement and *n*add is the noise sampled from a normal distribution. However, there is still another source of noise. Different measurement channels of a given EIT system can have different gains. This is due to different gains in the multiplexers [25]. The noise can be described through

$$\mathcal{U}^{\rm i,j}\_{\rm gain} = \mathcal{U}^{\rm i,j} \cdot n\_{\rm multi\prime} \cdot n\_{\rm multi\prime} \sim \mathcal{N}(0, \sigma^2), \forall \mathbf{j} \in \{1, \ldots, 16\} \tag{9}$$

Note that this noise only affects one row of the EIM, compared with Equation (7), where every entry is affected individually. For the additive noise, *<sup>n</sup>*add <sup>a</sup> *std* of <sup>1</sup> × <sup>10</sup>−<sup>8</sup> was chosen. For the multiplicative noise, *<sup>n</sup>*mult, an *std* of 1 × <sup>10</sup>−<sup>6</sup> was chosen.

#### 2.3.6. Rotation of the Data

To specifically incorporate the rotational invariance into the ANN, the voltage data were prepared with minimal computational costs, as follows. A shift of *n* columns along the EIM results in a rotation of the reconstructed image by <sup>2</sup>*π<sup>n</sup>* <sup>16</sup> . Thus, the target image must be shifted according to that angle. To get rid of the rotational variance, we produced 15 additional shifted voltages for each training sample, described previously.

#### 2.3.7. Alpha-Blending

With the given data set, there is still potential for obtaining entirely new training samples. In the field of image classification, there is a technique called *α*-blending [26–28]. It produces a new image from a linear combination of two other images, and an *α* ∈ [0, 1] factor weights these images. An *α* of 0.3 would mean that the resulting image is a combination of 30% of the first image and 70% of the second image. For EIT images, we can describe the technique as

$$
\sigma\_{\rm comb} = \kappa \sigma\_1 + (1 - \kappa) \sigma\_2 \tag{10}
$$

From Ohm's law with conductivities and the constant injection current follows the procedure to combine the voltages accordingly

$$\mathbf{Y\_{comb}} = \alpha \mathbf{Y\_1} + (1 - \alpha) \mathbf{Y\_2} \tag{11}$$

$$\Leftrightarrow \frac{I}{\mathcal{U}\_{comb}} = \kappa \frac{I}{\mathcal{U}\_1} + (1 - \kappa) \frac{I}{\mathcal{U}\_2} \tag{12}$$

$$\Leftrightarrow \frac{1}{\mathcal{U}\_{comb}} = \kappa \frac{1}{\mathcal{U}\_1} + (1 - \kappa) \frac{1}{\mathcal{U}\_2} \tag{13}$$

$$\Leftrightarrow \mathcal{U}\_{\text{comb}} = (\frac{\alpha}{\mathcal{U}\_1} + \frac{1-\alpha}{\mathcal{U}\_2})^{-1} \tag{14}$$

where *Y* denotes the admittance between the voltage measurement electrodes.

#### 2.3.8. Conclusion on Trainign Dataset

All in all, the choice of the simulated training data was made such that it was as realistic as possible, but at the same time no major assumptions were made on the structure and content of the simulated data nor on the bias in the dataset (e.g., restricting conductivity values to the range expected in the testing data). Furthermore, the described augmentation techniques impose no bias on the simulated data.

#### *2.4. On the ANN Structure*

In the domain of classification, ANNs can often be separated into two parts. The first part, consisting of convolutional layers, is used for the extraction of features, while the second part is used for processing these features into educated guesses about the class label. Two very famous examples are AlexNet and VGG19 [29,30]. The second part is realized through fully connected layers. In this work, we modified this basic approach and tailored it specifically for use in EIT. The structure can be seen in Figure 3. When calculating the receptive field of the convolutional layers, it can be seen that the receptive field is of the shape 21 × 21, although the EIM only has a shape of 16 × 16. However, Luo et al. showed that the receptive field exhibits a Gaussian distribution [31], which means that features in the center are strongly recognized by the network, while closer to the boundary the features are less recognized. To dampen this problem, we increased the receptive field of RF-REIM-NET.

From Figure 3, it can also be seen that the shape of the input after the convolutions does not change. To achieve this, we used circular padding rather than the standard zero padding. This choice can be understood by the nature of adjacent–adjacent measurements. On the boundaries of the EIM, values from the other side are inserted, as the neighbor of the 16th electrode pair would also be the 1st electrode pair.

**Figure 3.** Illustration of the RF-REIM-NET structure. At the beginning, the 16 × 13 = 208 voltages are transformed into an EIM. From these, features are extracted with the help of convolutional layers. At the end, these features are processed by fully connected layers, which reconstruct the image for each injection electrode.

Our proposed RF-REIM-NET structure also comes without any form of pooling layers. In general, pooling layers tend to increase the efficiency of ANNs; however, this comes at the cost of broken location invariant properties of the convolutional layers [32]. Another problem is that down sampling, when carried out by pooling, causes aliasing [33]. Thus, we did not use pooling and increased the receptive field of the RF-REIM-NET.

Instead of batch normalization, we used layer normalization. Instead of normalization along the batch, layer normalization computes the normalization along the features of the layer's output. We found that this works best for training.

The second part of RF-REIM-NET consists of a fully connected layer, adapted such that the rotational invariance is considered. The input to the first fully connected layer has the shape 16 × 2048. This was purposeful, as the 16 represents the 16 different current injection pairs. Thus, instead of passing a vector of 16 × 2048 = 32,768, 16 passes of a 2048 vector are used. This saves time during training and considerably decreases the size of the RF-REIM-NET. The second fully connected layer works the same way, but at the same time the output will be doubled to obtain a 64 × 64 reconstruction image.

#### *2.5. Training of the Neural Network*

Our training procedure was as follows. *α*-blending was used during training, and two batches were randomly combined as described in the methods section. As a regularization strategy, L2 weight regularization, dropout with a dropout rate of 0.1 and total variation (TV) regularization were used. L2 weight regularization was added, scaled to the loss of RF-REIM-NET, and can be described as

$$L\_{\text{tr}^2} = \sum\_{\mathbf{i}=1}^N w\_{\mathbf{i}\mathbf{\prime}} \tag{15}$$

where *N* is the total number of weights and *w*<sup>i</sup> is the ith weight. For details about dropout, see [34]. TV regularization is commonly used for image de-noising and de-blurring [35]. The TV regularization loss is computed with

$$L\_{TV} = \sum\_{\mathbf{i},\mathbf{j}} |\mathbf{Y}'\_{\mathbf{i}+1,\mathbf{j}} - \mathbf{Y}'\_{\mathbf{i},\mathbf{j}}| + |\mathbf{Y}'\_{\mathbf{i},\mathbf{j}+1} - \mathbf{Y}'\_{\mathbf{i},\mathbf{j}}|\_{\prime} \tag{16}$$

where *Y* is the output of RF-REIM-NET.

For the loss function we used the mean squared logarithmic error (MSLE). This error was chosen as our training data vary in orders of magnitude, and RF-REIM-NET should be able to predict the conductivities in the same way across the whole range. MSLE can be described as

$$L\_{MSLE} = \frac{1}{N} \sum (\log(\mathbf{Y} + 1) - \log(\mathbf{Y}' + 1))^2,\tag{17}$$

*Y* is the ground truth conductivity distribution. The total loss of RF-REIM-NET is described by

$$L\_{total} = L\_w \mathbf{\hat{z}} + \lambda\_1 \cdot L\_T \mathbf{v} + \lambda\_2 \cdot L\_{MSLE} \tag{18}$$

with *<sup>λ</sup>*<sup>1</sup> = 0.1 and *<sup>λ</sup>*<sup>2</sup> = <sup>1</sup> × <sup>10</sup><sup>−</sup>6.

RF-REIM-NET was then trained with the help of the TensorFlow [36] and the Adam optimizer [37]. Additionally, we used a learning rate decay: whenever the loss reached a plateau, the learning rate was reduced by 70%.

#### *2.6. Evaluating RF-REIM-NET*

We compared RF-REIM-NET to the standard Gauss–Newton (GN) reconstruction for absolute EIT. To compare the two algorithms quantitatively, we used a modified version of the GREIT figures of merit [38]. Some modifications were necessary, as the original GREIT figures of merit only allow the evaluation of difference images.

First, the calculation of the evaluation mask needs to be modified. The median value of the reconstructed image is subtracted from the original image,

$$
\sigma\_{\rm eval} = \sigma - \tilde{\sigma} \, \tag{19}
$$

where *σ***˜** denotes the median value of *σ*. The median was chosen, as it is more robust to outliers in the data. When inserting only one target for evaluation, an ideal reconstruction would have just two values, the background and the target. If we then subtracted the *mean*, the background would be slightly negative. This is prevented with the help of the subtraction of the median. The mask is then composed of all values, which are 50% less or more than the minimum/maximum value. The choice is dependent on the value of the target with respect to the background. In our evaluation case, the target is less conductive than the background. Thus, our mask is defined as

$$\mathfrak{m} = \begin{cases} 1 & \text{if } \sigma\_{\text{eval}} < \frac{1}{2} \cdot \min(\sigma\_{\text{eval}})\\ 0 & \text{else} \end{cases},\tag{20}$$

where *m* is the evaluation mask. We denote all pixels inside the mask as *<sup>σ</sup>***ˆ***eval*, while all pixels outside the mask are denoted as *∼σ***ˆ***eval*.

2.6.1. Amplitude Response (AR)

The AR is now defined as

$$AR = \sum \hat{\sigma}\_{\text{eval}}.\tag{21}$$

The *std* of the AR should be low.

2.6.2. Position Error (PE)

The PE is defined as

$$PE = \sqrt{(\mathbf{O}\_x(\mathfrak{d}\_{eval}) - t\_x)^2 + (\mathbf{O}\_y(\mathfrak{d}\_{eval}) - t\_y)^2},\tag{22}$$

where *<sup>x</sup>* denotes the x-component of the center of gravity, *<sup>y</sup>* denotes the y-component accordingly, *tx* is the ground truth x-position and *ty* the ground truth y-position. The *mean* and the *std* of the PE should be low.

2.6.3. Ringing (RNG)

The RNG was defined as the *std* of all pixels outside the mask *m*. Formally, this is written as

$$RNG = std(\sim \mathfrak{b}\_{eval}).\tag{23}$$

The *mean* and standard RNG deviation should be low.

#### *2.7. Evaluation Data*

To validate our RF-REIM-NET, we used three different types of input and analyzed the output according to the three introduced figures of merit (AR, PE and RNG).

#### 2.7.1. FEM Data

First, we used FEM data that the network had not yet seen. Multiple enclosures were simulated, and the enclosure was positioned such that it move from the domain center to the outside. RF-REIM-NET is compared with GN. For the hyperparameter selection of GN, we at first used the L-curve criterion, but did not find usable results. We are convinced that this is due to the reference-free reconstruction, which destabilizes the EIT problem compared with differential EIT. Thus, we made multiple sweeps of the hyperparameter to narrow down the optimal hyperparameter iteratively.

#### 2.7.2. Noise Performance on FEM Data

Second, we compared the noise performance on a FEM data sample. For that, an enclosure near the boundary was simulated and the noise level was increased from 200 to 5 db. For evaluation, GN reconstructions are given.

#### 2.7.3. Tank Data

Third, we used data from a circular EIT tank. The tank had a diameter of 28 cm and had 16 electrodes attached equidistantly around the surface. The tank was filled with 0.9% saline solution and the target was a pickle with a circumference of 4.5 cm. The pickle was moved from the center in the direction of one electrode in nine steps, where the last position was 9 cm in front of an electrode. The measurements were performed with the EIT evaluation kit 2 (*Draeger EEK2*, Draeger Medical GmbH).

#### 2.7.4. Experimental Data

Finally, we give an impression of the performance of RF-REIM-NET on real-world data. The data were taken from an experimental pig trial using a clinical EIT device (*Draeger Pulmo Vista 500*, Draeger Medical GmbH). For the trial, eight pigs were anesthetized and tracheotomized in supine position [39]. During the trial, CT measurements from the pigs were taken. In our data sample, we used two measurement points from a single pig, which was healthy in the time span we chose. The length of the data sample was around 30 s.

As there is no ground truth regarding the conductivity, we show two pictures. The first picture shows the *mean* conductivity over an entire breathing cycle, while the second shows the *std* over an entire breathing cycle. As the background, a CT image is given, as it will give a sense of quality. This is given for transparency, as tank data have more ideal conditions, which are closer to the training data. The absolute Gauss–Newton algorithm did not yield any meaningful results after a thorough hyperparameter sweep and was thus not given as a reference.

#### **3. Results**

At first, we give the results of simulated FEM data, outside the distribution of the training data used to train RF-REIM-NET. The results for the figures of merit are given in Table 1. The *std* of the AR is bigger for the GN algorithm; however, the *std* is one order of magnitude lower compared with the *mean*. Visually, this is confirmed by the reconstructions in Figure 4. The AR for both reconstructions stays roughly the same. For the PE, GN has a lower *mean* and *std*. The PE *mean* is half that of the RF-REIM-NET, while the *std* is a quarter. The RNG for the GN is also lower compared with RF-REIM-NET. This, again, can be seen in the images. The RNG for RF-REIM-NET is larger, due to the higher differences in magnitude outside the mask *m*.

**Figure 4.** Illustration of RF-REIM-NET (**middle**) and GN (**bottom**) reconstructions of FEM that mimic the position of the tank pickle data. The ground truth target positions are given at the (**top**).

**Table 1.** Figures of merit for the simulated FEM data. Given is the *mean* ± *std*.


To better see the difference between the original ground truth image and the reconstruction, we present in Figure 5 the ground truth, the reconstruction of RF-REIM-NET and the MSLE error. It can be seen that the error is for the most part on the edges of the enclosures. In the second column, we can see that the middle target is barely visible in the reconstruction.

**Figure 5.** Illustration of RF-REIM-NET reconstructions on the validation dataset. At the top, the original targets are given. In the middle, the reconstructions of RF-REIM-NET are presented. At the bottom, the MSLE error between the original image and the reconstruction are presented.

#### *3.1. Noise Comparison on Simulated Data*

Here, we compare the noise performance of the RF-REIM-NET compared with GN. We positioned a target near the boundary of the FEM domain and simulated the voltages. In Figure 6, the results can be seen. The reconstructions of RF-REIM-NET are more robust to noise, compared with GN. At 100 db the reconstruction of GN is barely visible, while RF-REIM-NET is still clearly visible. At 15 db the reconstruction of RF-REIM-NET begins to degrade and also becomes less visible.

**Figure 6.** Evaluation of the noise performance of RF-REIM-NET (**top**) compared with GN (**bottom**). The columns represent different noise levels added with the EIDORS function add\_noise. The target position is equal to T6 in Figure 4.

#### *3.2. Tank Results*

Next, we provide the results from the tank measurement. As shown in Figure 7, the pickle was moved from the center to an electrode. The reconstruction from the Gauss– Newton algorithm shows a more diffuse boundary, while RF-REIM-NET has a more clear boundary. The background from the Gauss–Newton algorithm shows many, but small background disturbances, while the background of RF-REIM-NET has fewer disturbances, where one is at the top and the other surrounds the conductivity enclosure.

**Figure 7.** Illustration of RF-REIM-NET and GN reconstructions on the pickle measurements of the tank dataset. The empty measurement was not used in both reconstructions, and is given only for a better view of the reconstruction artifacts. Pickle 1 is the center pickle, while Pickle 7 is the outer pickle.

These observations are also reflected in the figures of merit given in Table 2. The *mean* and *std* of the AR from RF-REIM-NET are bigger than the ones of Gauss–Newton. This can also be visually confirmed in Figure 7. The PE and its *std*, however, are lower in RF-REIM-NET. The *mean* of the PE from RF-REIM-NET is ∼35% lower than that of Gauss–Newton, while its *std* of the PE is ∼50% lower. The *mean* RNG of RF-REIM-NET is 20% lower compared with Gauss–Newton. However, the inverse is true for the *std*: the RNG *std* is 20% higher compared with Gauss–Newton. However, the *std* is <sup>1</sup> <sup>14</sup> th of the *mean*.


**Table 2.** Figures of merit for the tank experiment. The left number in each cell is the *mean* of the metric, while the right number is its *std*.

#### *3.3. Experimental Data*

For the experimental data, only qualitative analysis is given. The results are shown in Figure 8. On the left, a CT measurement is given to better judge the results. In the middle, the mean reconstruction over 20 s of mechanical ventilation is given. At the top, there is an artifact in the reconstruction. The two lungs are visible, but they are smaller compared with the original size. The heart on the other hand, between the lungs and the artifact, matches the position given in the CT image. The standard deviation picture on the right confirms the findings. The artifact stays in the position, as the standard deviation is near zero at this position. At the position of the heart a high standard deviation is visible, and the same holds true for the lungs. The shape of the standard deviation picture for the lungs better resembles the general shape of the lung.

**Figure 8.** Comparison of the RF-REIM-NET reconstruction with CT scans. On the very left, the CT scan of the pig thorax is given. In the middle, the mean over 20 s of mechanical ventilation is given. On the very right the *std* is given.

#### *3.4. Discussion*

In the FEM setting, GN outperformed RF-REIM-NET in the metrics of PE and RNG, both with the mean and the *std*. The mean and *std* of the AR are a little higher using GN. However, the values only differ slightly. Thus, we would argue that GN outperformed RF-REIM-NET in the FEM setting. This is probably due to the fact that the setting has not much disturbance by factors such as hardware or the imperfect conductivity of the target.

In a tank setting, RF-REIM-NET has a lower mean PE and mean RNG. The PE also has a *std* that is roughly half that of the GN PE *std*. This can be visually observed in Figure 7. However, at the same time, the mean AR and its *std* is higher. This, again, can be seen in Figure 7. In the samples "pickle 2", "pickle 3" and "pickle 4", the reconstruction is clearly larger than in the other samples. In contrast, GN has a less clear object boundary. Thus, we argue that on the tank dataset, RF-REIM-NET has a better performance. We showed that RF-REIM-NET is able to give reconstructions from experimental data, even though the ANN does not need any reference voltage, as can be seen in Figure 8. To the best of our knowledge, this is the first work to evaluate the performance of ANNs for EIT reconstructions on real-world experimental data from an ANN solely trained on simulated data.

While the heart was reconstructed accurately, the lungs were too small, which is at that point not fully useful for clinical diagnostics. Another shortcoming is the artifact at the top, which constantly stays in that position. We assume that the artifact is due to electrode position errors. At the top of the picture, the EIT belt is closed. Thus, the electrodes have a larger distance from each other at that place.

#### **4. Conclusions and Outlook**

We present an ANN (RF-REIM-NET) that is able to reconstruct conductivity enclosures without using a reference voltage. RF-REIM-NET is inspired by ANNs that are commonly used for classification: the first part of these ANNs extracts features, while the second part is responsible for the evaluation. Compared with GN on FEM and tank data, our approach tends to give clearer reconstructions. However, the images tend to be a little bigger than in real life. We also showed the performance on real-world subject data. Compared with GN, which did not obtain any meaningful reconstructions from the experimental data set, RF-REIM-NET was able to give reconstructions. For future work, the network needs to be made more robust against electrode position errors and domain shape influences, which may be the biggest impact factors on the experimental data performance. Thus, in the future, altering the electrode positions in the training data might improve the overall reconstructions. Second, the boundary shape has to be altered more drastically, as this might further increase the performance of RF-REIM-NET.

**Author Contributions:** Conceptualization, J.R., B.E. and C.N.; methodology, J.R. and B.E.; software, J.R. and B.E.; validation, J.R., B.H. and C.N.; formal analysis, J.R. and B.E.; investigation, J.R. and B.E.; resources, B.H., T.M., C.P. and S.L.; data curation, B.H. and T.M.; writing—original draft preparation, J.R.; writing—review and editing, J.R., B.E. and C.N.; visualization, B.E.; supervision, C.N. and S.L.; project administration, S.L.; funding acquisition, C.P. and S.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** The authors gratefully acknowledge financial support provided by the project InDiThera (13GW0361E) through the Federal Ministry of Education and Research. This research was supported by a grant (PU 219/2-1) from the German Research Foundation (DFG).

**Institutional Review Board Statement:** The study from which the real world data was used, was approved by the regional ethical committee (No. 5.8.18-15771/2017, Uppsala, Sweden).

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


## *Article* **Generalization Challenges in Drug-Resistant Tuberculosis Detection from Chest X-rays**

**Manohar Karki 1,\*, Karthik Kantipudi 2,\*, Feng Yang 1, Hang Yu 1, Yi Xiang J. Wang 1,3, Ziv Yaniv <sup>2</sup> and Stefan Jaeger 1,\***


**Abstract:** Classification of drug-resistant tuberculosis (DR-TB) and drug-sensitive tuberculosis (DS-TB) from chest radiographs remains an open problem. Our previous cross validation performance on publicly available chest X-ray (CXR) data combined with image augmentation, the addition of synthetically generated and publicly available images achieved a performance of 85% AUC with a deep convolutional neural network (CNN). However, when we evaluated the CNN model trained to classify DR-TB and DS-TB on unseen data, significant performance degradation was observed (65% AUC). Hence, in this paper, we investigate the generalizability of our models on images from a held out country's dataset. We explore the extent of the problem and the possible reasons behind the lack of good generalization. A comparison of radiologist-annotated lesion locations in the lung and the trained model's localization of areas of interest, using GradCAM, did not show much overlap. Using the same network architecture, a multi-country classifier was able to identify the country of origin of the X-ray with high accuracy (86%), suggesting that image acquisition differences and the distribution of non-pathological and non-anatomical aspects of the images are affecting the generalization and localization of the drug resistance classification model as well. When CXR images were severely corrupted, the performance on the validation set was still better than 60% AUC. The model overfitted to the data from countries in the cross validation set but did not generalize to the held out country. Finally, we applied a multi-task based approach that uses prior TB lesions location information to guide the classifier network to focus its attention on improving the generalization performance on the held out set from another country to 68% AUC.

**Keywords:** Tuberculosis (TB); drug resistance; deep learning; chest X-rays; generalization; localization

#### **1. Introduction**

According to the 2020 World Health Organization (WHO) report [1], it is estimated that in 2019 about 10 million people fell ill with Tuberculosis (TB) and about 1.4 million died from the disease. Based on the same report, it is estimated that in 2019 about 0.5 million individuals were infected with rifampicin-resistant TB out of which about 400,000 were multidrug-resistant.

Drug-resistant TB (DR-TB) is a growing public health concern requiring longer and more complex treatment than drug-sensitive TB (DS-TB), in addition to incurring higher financial costs. Treatment for DR-TB requires a course of second-line drugs for at least 9 months and up to 20 months, supported by counselling and monitoring for adverse events. In comparison, treatment of DS-TB only lasts between 6–9 months. Early diagnosis of DR-TB is crucial for selecting appropriate, patient-specific, treatment regimens. Thus,

**Citation:** Karki, M.; Kantipudi, K.; Yang, F.; Yu, H.; Wang, Y.X.J.; Yaniv, Z.; Jaeger, S. Generalization Challenges in Drug-Resistant Tuberculosis Detection from Chest X-rays. *Diagnostics* **2022**, *12*, 188. https://doi.org/10.3390/ diagnostics12010188

Academic Editors: Philippe A. Grenier, Henk A. Marquering, Sameer Antani and Sivaramakrishnan Rajaraman

Received: 1 December 2021 Accepted: 5 January 2022 Published: 13 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

improving early decision making has the potential to increase favorable patient outcomes, combat the spread of infection and reduce the overall financial costs associated with the disease.

Currently, the diagnostic methods for identifying DR-TB infections require culture and drug susceptibility testing. These procedures are not feasible globally, especially for countries unable to scale up their testing capacities. An automated, low cost, computational approach that utilizes readily available resources such as medical images and other clinical information is thus desirable.

In the context of TB diagnosis, automated deep learning based systems which only utilize Chest X-rays (CXRs) have seen significant success, with multiple commercial offerings available [2,3]. In one evaluation study, these systems classified CXRs as TB/not-TB with an Area Under the Curve (AUC) of above 0.9 [2]. In another study, they outperformed radiologists, with two of the systems meeting the WHO's target product profile for triage tests [3].

Currently, discrimination between DR-TB and DS-TB using readily available clinical images and possibly additional clinical side information is still an open problem. In this work, we used both deep and classical machine learning algorithms to classify drugresistant (DR) and drug-sensitive (DS) tuberculosis in chest X-ray images, radiological features, and clinical patient information. Specifically, we have:


#### **2. Previous Work**

Attempts to utilize images and clinical data to distinguish between DR-TB and DS-TB have been previously described in multiple publications. These works are either based on the utilization of radiological findings identified in the image by a clinician or via fully automated methods which receive as input the image and potentially other available clinical data and output the likelihood for each of the two classes.

Several studies have shown that radiological findings based on a radiologist reading of a CT or CXR have the potential to differentiate between the two classes. A literature review from 2018 [4] concluded that the presence of thick-walled multiple cavities in the images is a useful predictor for DR-TB, with good specificity but low sensitivity. Another study [5] compared 183 DR-TB cases and 183 DS-TB cases from a single hospital. This study concluded that there were substantial differences in findings between the two classes in terms of lesion size and morphology. A slightly larger study [6], which compared 468 DR-TB cases and 223 DS-TB cases concluded that a combination of the number and size of consolidated nodules is a good predictor for DR-TB. Another, small study [7], utilized data from 144 patients and found that the presence of multiple cavities is a good predictor for DR-TB. A much larger study [8], compared 516 DR-TB and 1030 DS-TB cases, obtaining an AUC of 0.83 using a regression model. This study observed that the co-existence of multiple findings (multiple cavities, thick-walled cavities, disseminated lesions along the

bronchi, whole-lung involvement) was indicative of DR-TB. Finally, more recent work [9] compared 1455 DR-TB and 782 DS-TB cases, using two clinical features and 23 types of radiological findings. A support vector machine was used to distinguish between DR-TB and DS-TB with an AUC of 0.78. It should be noted that reliance on a radiologist reading is a significant limitation. The lack of consensus on radiological findings for drug resistance further hinders the clinical usefulness of these approaches. Because of these reasons, fully automated solutions, described next, are more desirable.

Several fully automated solutions were presented as part of the ImageCLEF 2017 and 2018 evaluation challenge forums [10]. These challenges included a subtask, differentiating between DR-TB and DS-TB using thoracic Computed Tomography (CT) images. This classification task included 259 training images and 236 test images with about half of the cases DR-TB and half DS-TB. Proposed solutions included Gentili et al. [11] who reformatted the CT images to the coronal plane and used a pre-trained ResNet50 Convolutional Neural Network (CNN). For the same challenge, Ishay et al. [12] used an ensemble of 3D CNNs and Cid et al. [13] used a 3D texture-based graph model and support vector machines (SVM). Allaouzi et al. [14] replaced the softmax function of a 3D CNN architecture with an SVM to tackle this classification task. All entries had limited success, resulting in AUCs of about 0.6. After two editions, the organizers removed the subtask from the competition with the conclusion that "the MDR subtask was not possible to solve based only on the image". While these challenges did not yield the desired results, the results obtained using radiologist readings are more favorable, suggesting that the sub-optimal performance may be due to the small number of images available for training. It should be noted that increasing the number of CTs for this task is not trivial as the use of CT imaging in DS-TB cases is uncommon, with the standard imaging modality being CXR. The rare use of CT imaging in standard practice, and the consequential lack of data to analyze, limits the usage of CT images to train a model to distinguish between DR and DS-TB.

On CXR images, [15] utilized a customized CNN architecture to classify DR-TB and DS-TB from 2973 images from the TB portals dataset. They achieved a classification performance of 66%, which improved to 67% when follow up images were also included. Our group has previously proposed fully automated methods utilizing CXRs as described in [16,17]. In [16], we utilized 135 CXRs from a single source. Using a shallow neural network we obtained an AUC of 0.66. In [17], we utilized a much larger dataset, 3642 images from multiple sources. Using a deep neural network, InceptionV3 pre-trained on ImageNet, we obtained an AUC of 0.85. This result is the current state-of-the-art performance achieved on the TB portals data. This is a significant improvement of results from other approaches. However, even though a 10-fold cross validation was performed, the capability of the trained network to classify chest X-rays from unseen domains was not evaluated. In fact, the common weakness of all of these automated methods is that they have not been evaluated for generalization by separating the source of the data. As different medical imaging technologies and devices produce different standards and quality of images, it is important for our models to be robust to these changes.

An underlying assumption of most machine learning algorithms is that the population, test, and training data are independent and identically distributed. If the two distributions are different, then the learned parameters will not yield a good performance. That is, the model will not generalize well to unseen data. While CXR imaging is a low cost modality that is in widespread use, the variations in the standards of the acquired images is significant [18,19], bringing into question the utility of any proposed method which is not evaluated on its generalization capability. More specifically, Harris et al. [20] found that 80% of published works on using CXR for TB diagnosis either used the same databases to train and test their software, or did not comment on databases they used for testing their models. Sathitratanacheewin et al. [21] also observed that a model for CXR-based TB diagnosis performed well with 0.85 AUC when tested on images within their intramural dataset with significant performance deterioration when tested on extramural images, yielding an AUC of 0.7. For domain shift, when the change in image distribution between the training and

testing sets is inevitable, it has been shown that these effects can be ameliorated if training is formulated using a multi-task approach [22].

In addition to the generalization issues due to domain shift, the generalization of deep learning algorithms can also deteriorate if they learn irrelevant features. This is a specific shortcoming of deep learning algorithms as they do not preclude the algorithm from learning features present in the training set that are arbitrarily correlated with the disease, yet are completely irrelevant. These can stem from characteristics of the imaging devices or clinical practices such as patient positioning [23–25] used at the specific locations. If a model implicitly learns such features it will not generalize well when presented with data obtained on different imaging devices or using different clinical workflows, both of which are irrelevant to disease diagnosis.

In this work, we explore various strategies to improve the generalization of models for classification of CXRs as DR-TB or DS-TB using various normalization and attention mechanisms, both explicit (segmentation based) and implicit (multi-task based).

#### **3. Data**

#### *3.1. TB Portals Data*

The primary data source used in this work is from the NIAID TB Portals program (https://tbportals.niaid.nih.gov (accessed on 10 January 2022)), with a public data release date of October 2020. The dataset contains clinical data and CXR images that are anonymized and made available for public use [26]. Each patient record is manually annotated with clinical information and radiological findings based on the associated CXR image. For this work, data from 1756 patients from ten countries were used. Table 1 shows the data distribution based on country of origin and gender. It should be noted that the TB portals data were collected with a primary focus on acquisition of drug-resistant cases and cases that reflect the specific research interest at the country of origin. As a result, the data are imbalanced in terms of the ratio between drug-resistant and drug-sensitive cases, which does not necessarily reflect the prevalence of TB from either class in the contributing country. Interestingly, we also see that the data are not balanced in terms of gender with about double the number of males to females. This does reflect known differences in TB prevalence in females versus males and has been linked to both societal and biological differences between the sexes [27–29].

**Country Number of Patients** *Drug-Sensitive Drug-Resistant Male Female* Belarus 118 344 294 168 Georgia 399 236 472 163 Romania 15 114 91 38 Azerbaijan 0 32 24 8 India 197 21 165 53 Moldova 12 32 37 7 Kyrgyzstan 0 18 11 7 Ukraine 8 25 25 8

Kazakhstan 15 53 36 32 South Africa 114 3 72 45 Total 878 878 1227 529

**Table 1.** Patient distribution from different countries and genders for the chest X-ray data used in this work.

#### *3.2. Clinical Data*

The clinical data contain an extensive set of features associated with each patient. This includes demographic data, radiologists' findings for each CXR, different diagnostic tests and treatment information. Additionally, it includes demographic features such as age of onset, gender, patient type (New, Relapse or Failure), body mass index, country of

origin, education, employment, number of daily contacts, number of children, prescription drug usage, laboratory tests, treatment period, treatment status and outcome. The radiologists' findings include chest radiography patterns such as nodules, cavities, collapses and infiltrates and their location in the lungs. Due to financial constraints and the size of the TB portals CXR dataset, radiological findings are obtained using a single experienced radiologist-reading per image. The whole dataset was annotated by multiple radiologists from the countries contributing data to the program. Consequentially, the radiological findings are not biased towards a single radiologist. Table 2 lists all finding types used by the radiologists to annotate the images. These are abnormalities commonly associated with TB. In addition to the type of abnormality, the findings are further differentiated based on their size (small, medium, large) and number of occurrences (single, multiple).

**Table 2.** Twenty features derived from the presence of abnormalities that are localized to different sextants.


#### *3.3. Chest X-ray Images*

All TB Portals CXRs used in this work are from a frontal, AP or PA, view and have varied resolutions (206 × 115 to 4453 × 3719). The intensity range found in the images also varies, with 1177 images having a low dynamic, intensities in the 0–255 range, and 579 images having a high dynamic, intensities in the 0–65,536 range.

It should be noted that the drug susceptibility label associated with each image is obtained via drug susceptibility testing and is not derived from the image. Additionally, the usage of radiological findings for predicting drug susceptibility has shown moderate success. Thus, the question of whether good performance for predicting drug susceptibility from CXRs from unseen sources is possible remains open.

In addition to the CXRs from the TB Portals program, we use a publicly available TB CXR dataset collected from a hospital in China [30] (Download from http://openi.nlm.nih. gov/imgs/collections/ChinaSet\_AllFiles.zip (accessed on 10 January 2022)). This dataset contains 662 frontal chest X-rays, of which 326 are labeled as non-TB cases and 336 are labeled as TB. There are two sets of annotations where each abnormal TB image has been manually annotated by two radiologists. Figure 1 shows one such segmentation.

#### Sextant Division

To further differentiate between radiological findings, we associate them with their spatial location in the lungs. To do so, we define lung sextants by dividing each lung into three equal sections from apex to base, as shown in Figure 2. The division of the sextants can be subjective for findings close to sextant borders, when the division boundaries may not be strictly adhered to by the annotating radiologist. In this work, we say a sextant is affected by TB if at least one of the abnormalities listed in Table 2 is present in the sextant.

**Figure 1.** Example of a lung segmentation for a nodule and a cavity.

**Figure 2.** Definition of six lung sextants. Abnormal annotations are assigned to one or more of these sextant divisions.

#### *3.4. Dataset Definitions*

For our experiments, we only select the first image taken in the clinical process for a patient; hence, the number of images is equal to the number of patients. All of our drug resistance classification experiments feature an equal number of DR and DS patients in the training set. For data balancing, we use a conservative approach, excluding images from the majority class. The subsets used for training and evaluation are listed below:


#### *3.5. Data Standardization*

Lung segmentation is used to explicitly address the challenges associated with generalization due to domain shift and the possible existence of confounding factors due to class-correlated yet irrelevant features. Segmentation enables us to limit the input images for the binary DR/DS classifier so that they only contain regions relevant for classification of pulmonary tuberculosis, meaning the lungs. Additionally, the lungs are scaled to a uniform size and position within the image, removing potential confounding factors such as lung size and patient placement that are often correlated with the clinical sites and thus with the local prevalence of TB types. Once the lung regions are segmented, the image is cropped to the lung bounding box, Figure 3c, and all information outside the lung is removed, Figure 3d.

For lung segmentation we initially utilized a publicly available U-Net model which was trained on two datasets with a total of 385 images and corresponding manual lung segmentations [30,31] (https://github.com/imlab-uiip/lung-segmentation-2d (accessed on 10 January 2022)). Unfortunately, this model failed frequently when applied to the TB portals images. Often, one or both sides of the lung were not segmented.

Furthermore, segmentation using this model failed on pathological lung regions in a significant number of images, which is detrimental for disease analysis.

To address these performance limitations a U-Net based [32] segmentation model with a ResNet50 backbone [33] was trained using the publicly available v7 COVID-19 X-ray dataset, which contains 6500 images and corresponding manual lung segmentations (https://github.com/v7labs/covid-19-xray-dataset (accessed on 10 January 2022)).

As the TB portals dataset does not provide ground truth lung segmentations, results were visually evaluated as either failure or success. The segmentation failure rates of this model and the previous model were 0.06% and 3% respectively. Aside from that, the old model segmented one of the lungs with less than 10% of the corresponding ground truth pixels in 0.8% of the cases. No such cases were observed in the new model. Figure 4 illustrates the difference between the two models applied to the same set of 72 images.

**Figure 3.** Original CXR (**a**) is fed to the U-Net, which outputs a binary lung mask (**b**) with which the original CXR is cropped(**c**) and the lungs are segmented (**d**) in the cropped bounding box.

**Figure 4.** Cropped images based on the lung segmentation results obtained using a publicly available UNet model trained on the combined JSRT and Montgomery datasets (**a**), and using a customized UNet model trained on the v7 COVID-19 X-ray dataset (**b**). The performance of the customized model is clearly better. Note that if the segmentation fails, the entire image is used.

#### **4. Drug Resistance Classification**

For classifying between drug-resistant and drug-sensitive TB, we primarily use the chest X-rays but also utilize text data to assist with the classification and to compare the performance when using just the images. Classic machine learning algorithms and CNNs with pretrained weights were used on the clinical text data and chest X-ray images respectively. Figure 5 shows the setup of our classification network where the preprocessed chest X-ray image is the input and the prediction is either drug-resistant (DR) or drugsensitive (DS).

As the focus of this work is to evaluate, understand and propose solutions to the issue of generalization to unseen data, we describe in the following subsections: (a) the need of domain adaptation for a network to generalize to unseen data, (b) the use of radiomics features derived from chest X-rays, (c) multi-task learning as a means to provide implicit attention to the main task of DR/DS classification, (d) classifying per-sextant abnormality, and (e) segmenting abnormal regions.

**Figure 5.** Standard CNN architecture for drug resistance classification. The Input (X) is a preprocessed X-ray image with segmented lungs. The Output (Y) is one of two classes, DR-TB or DS-TB. For this work, we use the ResNet18 [33] architecture as the backbone for all of our experiments.

#### *4.1. Domain Adaptation*

The distribution for which a trained model is tested can often be significantly different from the distribution that it was trained on. There is no guarantee that a trained model will be robust to data it has not seen before. Different acquisition standards [34], equipment, and even personnel can create vastly different looking images for medical images. Even after acquisition, other processing and storage differences can create differences in the images. For a human, these variations may be easier to overcome but a machine learning model needs to be trained to understand the differences. Either smart features and algorithms need to be employed or a large and diverse set of data is required to train such a model.

Evaluation of models on unseen data from different domains is the logical way to evaluate such models. Besides that, interpreting the model's decision can also be valuable to understand a prediction. For drug resistance TB classification, localizing the prediction decisions is worthwhile, as tuberculosis itself is frequently observed in certain regions of the lung.

Furthermore, it is worth exploring how easily images from different domains can be discriminated. Easily distinguishable domains in the input coupled with an imbalanced dataset can readily result in failure to generalize.

The usage of transfer learning enables a model to adapt to the new domain, but is less desirable when compared to a fixed model which does not require additional training per domain. Starting from pretrained weights allows for the high-level features to be consistent and not overly dependent on the domain of the training images. This explains why networks initialized with pretrained weights consistently outperformed networks randomly initialized [17].

#### *4.2. Radiomics Features*

Usage of explicit, engineered features can be used as a counter measure to prevent a network from learning correlated, yet irrelevant, features within a dataset. Radiomic features have been used to extract patterns that may have been missed by radiologists to identify abnormalities present in medical images [35].

The features used in the paper include 2D-shape based features (e.g. axis lengths), first order statistics (e.g. skewness), gray-level co-occurence matrix features (GLCM), gray-level dependence matrix (GLDM), gray-level run length matrix (GLRLM), gray-level size zone matrix (GLSZM) and neighbouring gray-tone difference matrix (NGTDM) features.

#### *4.3. Multi-Task Learning*

To encourage a network to focus on desirable localities, such that the network is generalizing based on the actual abnormalities within the provided anatomical regions, adding a secondary task sharing some of the features is a promising approach. When networks have been trained to predict different but related tasks in tandem, performance on each of the tasks have benefited [36]. The auxiliary information from a secondary task can be beneficial to the main task and is useful to regularize the network as well.

A big motivation behind using multi-task learning for this work is the availability of the radiologists' annotations for different abnormalities in the lung, localized to the six divisions (sextants) of the lungs. While this information will not be available during testing, it can be utilized to regularize the main model and to focus the attention of the network to supposedly relevant areas of the image. For our drug resistance classification, the network consists of a pretrained CNN network that is trained with binary cross entropy loss. For multi-tasked networks, the combined loss [37] for the two tasks is as follows:

$$\mathcal{L} = \frac{1}{\sigma\_1^2} \times \mathcal{L}\_1(\mathcal{W}) + \frac{1}{\sigma\_2^2} \times \mathcal{L}\_2(\mathcal{W}) + \log \sigma\_1 + \log \sigma\_2. \tag{1}$$

where *W* represents the weights of the network, and *σ*<sup>1</sup> and *σ*<sup>2</sup> are the noise parameters for the respective tasks, which are used to determine the relative weights given to each of the losses.

#### *4.4. Abnormal Sextant Classification*

The abnormal sextant information provides the locations of TB-related abnormalities to the network. These are expert-annotated features that provide additional context and information during DR/DS classification. Figure 6 shows the architecture for this type of multi-task learning. The classification model is modified such that the output of the last convolution layer is diverged into two stacks of fully-connected layers. The first path is the same as the normal architecture where the network decides if the X-ray image shows manifestations of drug resistance or drug sensitivity. The second path outputs a vector of length 6. Each of the six values represents the presence or absence of any of the 20 abnormality features described in the Data section above. Hence, for each sextant, if one of the abnormalities is present, the sextant is considered 'abnormal,' whereas if none of the abnormalities is present, it is considered 'normal.' In Equation (1), L<sup>1</sup> and L<sup>2</sup> are both binary cross entropy losses in this case. For the secondary task, the loss is the average loss among all six outputs.

#### *4.5. Abnormality Segmentation*

Segmenting abnormalities provides location information as the locations of each of the sextants are also available. For this task, the losses from Equation 1 are modified such that L<sup>1</sup> is the binary cross entropy loss and L<sup>2</sup> is the combination of Jaccard loss and binary cross entropy loss for the segmentation of abnormal regions. Figure 7 shows the modification of the base model into an encoder-decoder U-Net style architecture for the additional task of abnormality segmentation. Two approaches are taken into account to determine the abnormal ground truth regions.

#### 4.5.1. Sextant Segmentation from Radiologist Annotation

The sextant annotations from the radiologist are converted into masks such that the location information of each sextant is also available to the network. Each pixel in the sextant with presence of any abnormality is set to 1, and 0 otherwise. This is similar to the sextant classification with added information of the location of each of the sextants.

#### 4.5.2. TB Abnormality Segmentation

Instead of using the abnormalities from clinical text data, we alternatively use the TB abnormality segmentation network to derive the ground truth. The Shenzhen dataset with annotations has a finer segmentation of abnormalities. In this approach, the chest X-ray images are segmented for lesions using the network trained on the Shenzhen data [30]. The advantage of this approach is that even images without annotations can be used for training.

**Figure 7.** Multi-output network for the additional task of abnormality segmentation. The pretrained backbone (ResNet18) is modified to be a U-Net with encoder and decoder. The inputs for this network are chest X-ray images whereas the segmentation output masks are derived from the clinical text data.

#### **5. Experimental Results**

We perform our experiments with a 5-fold cross validation stratification. We also separate 7.5% of the training data, to check inter-epoch performance and stop the model early once the performance degrades for a long period on this set. The pretrained network backbone, as described in Figure 5, is the ResNet18 [33] architecture. The number of parameters for ResNet18 (11 million) are half of that of InceptionV3 (22.3 million), which we previously used [17]. Even with the smaller network and smaller dataset (since samples are held out), the performance on the validation set was 79% AUC. As we convert these networks to a U-Net style segmentation network for secondary tasks, the difference in parameters is increased even more. With the choice of ResNet18, we are able to transfer the pretrained weights from ImageNet [38] and keep the number of total trainable parameters small while having a consistent network to compare different approaches.

#### *5.1. CNN-Based Drug Resistance Classification*

To examine if our network is robust against domain changes and if it generalizes well to unseen data, we exclude the data from one country before cross validation stratification and use it as a held out set. As we see from Table 1, the only two countries that have more than 100 samples in each class are Belarus and Georgia. Aside from these two countries, every other country has less than 25 samples in the minority class. Choosing other countries would lead to a highly imbalanced testing set or a testing set with very few samples per class. When we excluded Georgia to use it as a held out set for evaluating generalization, the total number of samples decreased by almost half. There were only 479 samples per

class for the balanced training set. The AUC performance on the validation set was 78% but the performance on the held out Georgia data was at 52%. Also, less than 25% of the Georgia patients had sextant information available and hence the evaluation for the multi-task learning was not feasible. The data from Belarus has been used previously in [16], to both train and evaluate the classification of DR/DS TB from chest X-rays. Because of these reasons, we only used the data from Belarus as our held out set and used Georgia's data as part of the training sets.

For our classification training, we experimented with two different sets of initialization weights. The first set of weights is from the ImageNet classification task and the second set is from the network trained for TB-abnormality segmentation described in Section 5.6. When we trained the model with the cropped images (Figure 3c), similar to the previous approach [17], the performance on the validation set was 73% AUC and on the Belarus dataset it was 55% AUC.

In an effort to improve generalization, explicit attention on the lung regions was provided by setting the areas outside the segmented lung to 0 (Figure 3d). This approach improved the classification performance on each of the datasets. Table 3 shows that the best AUC performance was observed on the validation set, using both the ImageNet classification and TB abnormality segmentation weights, with 79% AUC. On the Belarus dataset, the best AUC of 65% was observed with the dataset with sextant information and with ImageNet weights. Achieving a much better performance with this approach, we use the segmented lungs as an input to the following experiments.


**Table 3.** DR-TB/DS-TB Classification Performance on the Validation Set and the Belarus Set.

#### *5.2. Classification with Radiomic Features*

With the usage of non-learnable features, some acquisition-specific details can be hidden, which may be easily identified by a sufficiently large deep network. For this purpose, 104 radiomic features are extracted with the aid of the pyradiomics (https: //pyradiomics.readthedocs.io/en/latest/features.html (accessed on 10 January 2022)) library [35]. The library calculates the features based on the X-ray image and the mask of the object of interest. We evaluate these features on both the lungs and the rest of the image by providing the lung masks and the complement of the lung masks, respectively. For our classifiers we use standard machine learning algorithms such as support vector classifiers (SVC), k-nearest neighbors (k-NN), Random Forest (RF) and multi-layer Perceptron (MLP).

The support vector classifier achieved the best performance on the validation set and the Belarus dataset, as seen in Table 4. Surprisingly, the best validation performance (74.5%) was computed when the lung region was excluded, that is, only non-lung parts of the image were used to derive these features. The performance on the Belarus dataset was 62.8%.

**Table 4.** AUC Performance with radiomic features.


#### *5.3. Classification with Sextant Divisions*

The location of abnormalities are useful for classifying tuberculosis [39,40]. The sextant-based annotations are localized features that show different abnormalities within the lung. We also divide the chest X-rays into six divisions similar to how they were annotated by radiologists.

We classify DR-TB and DS-TB from the annotations acquired and our divided chest X-ray images. As described in Table 2, there are 20 such features for each of the sextants. Hence, there are 120 features in total. Figure 8 shows how abnormalities are more frequent in the apex of the lung.

Figure 9 shows the classification performance when individual sextants, the entire lung, and top, bottom, and middle regions were evaluated regarding DR-TB vs DS-TB classification. On the validation set, the CNN classifier trained on chest X-rays performed indiscriminately to the lung location used for training. With the annotated data and classical machine learning classifiers, the top sextants were more discriminatory than the bottom sextants. On the Belarus dataset, however, classical machine learning classifiers were not able to discriminate between the two classes with much success. An MLP (multi-layer Perceptron) classifier was able to achieve 60% AUC performance. When a single sextant was used, they all performed similarly at around 60%. Training on the entire image yielded the best results (65%) on the CXR images.

When we reduce the number of features to use just the location information or the type of abnormality, providing the location yielded better AUC performance (62.9%) on the Belarus dataset. However, on the validation set, providing the type of abnormality performed better (AUC of 70.0%) as shown in Table 5. 'Location' refers to the presence of any abnormality in sextants whereas 'Type' refers to the presence of one of the 20 abnormalities listed in Table 2 in any area of the lung.

**Figure 8.** Abnormality occurrence heatmap in different regions of the lung derived from radiologists' annotated sextant data.

**Figure 9.** Drug Resistance Classification—AUC performance on sextants.

**Table 5.** Performance when location and type are separated for annotated radiologic features.


#### *5.4. Classification with Data and Network Capacity Limitations*

The classification performance with radiomic features derived from the non-lung region also prompted us to further examine the performance of our CNN classifier with limited information in Table 6. The limitations we added are regarding the input data and the training networks.

To further investigate the bias in the data that is supposedly not related to the underlying disease manifestations, we apply limitations to the information received by the network or limit the capacity of the network itself. This was achieved by modifying the data as well as the training network. Figure 10 shows examples of different ways the X-ray images were manipulated to reduce the information input to the classifier network. The information from the chest X-rays were limited or diminished by randomizing pixel locations. For example, in a particular experiment, entire image intensities were randomized. This would conceal the spatial relationships between pixels but still preserve the histograms and first order statistics of image intensity values. For further experiments, only certain regions of the image were randomized and the rest were set to 0. Lung masks were also used as input where the pixel intensity values are lost but the shape of the lungs are still intact. Another approach was subtracting the mean of the background (non-lung) and re-normalizing each image.

**Figure 10.** Different ways in which a chest X-ray image is modified to limit the information the classifier relies on. Top (left to right): (**a**) Cropped X-ray image, (**b**) lung mask, (**c**) segmented lung, (**d**) segmented lung with mean of background subtracted. Bottom (left to right): (**e**) Entire image randomized, (**f**) lung pixels randomized with non-lung area set to 0, (**g**) non-lung pixels randomized with lung pixels set to 0, and (**h**) lung pixels set to 0.

**Table 6.** Performance with input and model limitations.


To limit the network and its capacity, all the convolutional layers of the CNN were frozen (set to become non-trainable). These layers were used as a feature extractor and the fully-connected layers acted as a trainable classifier. When the lung pixel values were set to 0 (lung excluded), the performance on the verification test set (Belarus data) with the CNN network was 59%. Histogram normalization did not yield better results for this. As expected, when we completely randomized the pixel locations of the entire image or of the lung regions the performance was random (50%) on the balanced Belarus dataset. Using the shape of the lungs (lung masks) without the intensity values was enough to improve that performance to 55%. When intensity values of the non-lung regions (background) were provided, the performance further improved to 59%. On the same images, if the background pixels were randomized again, the performance dropped back to 55%. This series of results hints that the shape of the lung itself carries useful information. However, it was interesting that the non-lung regions had a small contribution to the identification of drug resistance. The performance on both datasets improved when the local information of the non-lung regions was retained.

Freezing the convolutional layer weights (ImageNet weights) but allowing the fullyconnected layer weights to be trainable, the performance achieved was comparable at 62%. Here, the frozen convolutional layers are acting as fixed feature extractors and the dense layers are learning to interpret them. The approach and the performance were comparable to using the radiomic features. Normalizing by subtracting the mean of the non-lung regions from the lungs also had a similar performance of 61%.

#### *5.5. Country Classification*

Being able to identify the origin of the chest X-rays can be insightful in understanding the extent of bias in the data based on the data origin and the acquisition standards within a country. As seen in Table 1, the distribution of samples from different countries is not uniform and neither is the distribution of samples of each class. The chest X-ray images originate from a variety of imaging devices from hospitals in several different countries with their own imaging protocols and other variances that affect image content. These types of artifacts can often be identified by a deep neural network but may not be visible to a human observer. The entire dataset was used for the country classification experiments. Because the number of samples was imbalanced, we used weights based on the number of samples for the categorical cross entropy loss function during our training.

The mean intensities of the input images to the training network were plotted to observe the distribution differences across various countries in Figure 11. The width of each violin plot represents the frequency of the mean intensity. Generally, the wider the violin plot, the higher is the probability that the images of the respective country have the corresponding mean intensity.The histogram equalization centers the mean of the images intensities to zero for each of the countries and helps to reduce some of the bias present in the dataset due to the images' country of origin. The variance in the intensity distribution is still present.

The multi-class country-of-origin classification from the X-ray images achieves an accuracy of 85.7%. When histogram equalization is applied to the same images to remove some of the intensity-based biases, performance decreased to 82.6%. These results show that the models are very efficient in deriving the country of origin from the chest X-ray images. Even with histogram equalization, the country classification performance did not decrease sharply. This points to other biases in the data (potentially other acquisition biases) that are not accessible like the country of origin.

We also performed a multi-class country-of-origin classification based on clinical text data. Demographic features such as gender, age, and education, were included along with radiological findings from chest X-rays (such as nodules, cavities, infiltrates, collapses, etc.). The multi-class country-of-origin classification with seven demographic features and 20 radiological features resulted in an accuracy of 59.3%, and the classification with just 20 radiological features showed an accuracy of 35.2%. The confusion matrices in Figure 12

show that the image-derived features have many fewer false predictions when classifying the country of origin compared to the model trained on clinical features. The performance of this classifier is based on clinical findings whereas the deep learning classifier is using the image content which potentially includes information not related to the disease and not visible to the human observer but detectable by the network, allowing it to obtain better performance. Consequently, the DR/DS classifier also has access to this non-disease specific information which may introduce confounding features into that model, improving its performance on the trained domain but harming its generalizability.

#### *5.6. Tuberculosis Abnormality Segmentation*

Transfer learning with pretrained weights has been effectively used not only to reduce training time compared to random initialization but also to obtain a better performance. Intuitively, it makes sense to use TB abnormalities as priors to the drug resistance classification as well. Hence, we utilize the TB abnormality segmentation network to aide the classification of drug resistance, using its weights and output for the multi-task classification. We use the Shenzhen dataset with TB lung lesions [30], which has two sets of annotations for the same image.

The average segmentation overlap between the two sets of annotations was 0.538 (Dice score). For the TB abnormality segmentation network, the cross validation Dice score was 0.636. The weights for this network are used to initialize the classification and multi-task models. For the classification tasks, only the encoder weights were used, whereas for the segmentation tasks all the convolutional layer weights were used.

**Figure 11.** Mean intensity distributions per country for the normalized cropped X-ray images (**top**) and the same images after histogram equalization is applied (**bottom**).

**Figure 12.** Summation of confusion matrices on test sets from 5 folds on the country-of-origin classification task. (**top-left**) Lung-segmented images, (**top-right**) histogram-equalized lung-segmented images, (**bottom-left**) radiological and demographic features and (**bottom-right**) radiological features. Note that the image-derived features have a significantly better performance than the disease-specific radiological findings.

Table 7 shows that the annotations by radiologists matched with the abnormalities identified by the TB abnormality segmentation model in each of the sextants and overall. The average GradCAM [41] map values for each of the sextants was poorly correlated with sextant annotations. Abnormality predictions were also derived from the mean of the GradCAM heatmaps (with a threshold of 0.7) and compared with the abnormalities from the TB segmentation model. On both of these, the top sextants had very little overlap compared with the middle and bottom sextants. This result is contrary to expectation as TB abnormalities are often seen at the top of the lung [42].

Figure 13 shows some examples of DR-TB and DS-TB. GradCAM heatmaps show the average of activations from the last layer of a CNN. In these examples, it can be seen that for the DR-TB class, activations are more frequent. This is consistent with previous results [9], where DR-TB patients are shown to have more abnormalities in their lungs.


**Table 7.** Comparison (AUC) of localized abnormalities from different sources tested on Belarus dataset.

**Figure 13.** GradCAM visualizations on X-ray images. **Top row** shows examples of drug-sensitive TB and **bottom row** shows examples of drug-resistant TB.

#### *5.7. Multi-Task Based Classification*

As an approach to assist the drug resistance classifier, a secondary task was added to further incentivize the network to focus on relevant areas. The second task acts as a regulator to constraint the neural network to focus on the interesting regions. ResNet18 was used as the primary backbone for the networks and layers were added to generate output for the secondary tasks. Instead of experimentally determining the best loss weights for the model, we allow them to be learnt by the model itself.

Initially, both tasks were given equal weights to allow the network to determine which task needs to be focused. To restrain the scope of the work, while we monitored the performance on the secondary class, we only used the performance on the main task to determine our experimental setup and hence we report the performance on the main task only.

While adding a related secondary task did not improve the drug resistance classification on the validation set, the AUC performance on the Belarus dataset improved by about 2%–3% and accuracy improved by 1%. An AUC of 68% (±1%) was the best performance achieved on the Belarus dataset as shown in Table 8.


**Table 8.** Classification performance with an additional task.

#### **6. Conclusions**

This paper explores the cross validation and generalization performance achieved for drug-sensitive and drug-resistant classification on chest X-rays from different countries. By excluding data from one country of origin from training and using it for testing, we evaluate classifier performance on unseen data. The generalization performance was much lower (65% AUC) compared to the cross validation performance (79% AUC). The same CNN architecture was able to classify the country of origin from a chest X-ray image. Evaluations with radiomic features from X-ray images, and experimental limitations to the data and classifier, indicated that the model based its decisions on other artifacts present in the images. TB lesions annotated by radiologists were utilized to see if the location information was useful for discriminating between drug-resistant and drug-sensitive cases. While GradCAM heatmaps from the X-ray image-based CNN model did not overlap significantly with the TB lesions and the annotations from radiologists, adding a secondary task related to the localization of lesions did improve the classification performance to 68% AUC. Because of an imbalanced dataset, insufficient amount of samples of one of the two classes, and the lack of clinical text data describing the radiological findings for all the patients, we only excluded one single country for our generalization evaluation. A solution that does not require annotations by radiologists to improve the generalization performance would be more valuable. Procedures and methods that allow the model to pick up only the manifestations of disease are a direction for future research. In general, we believe that experiments addressing generalization to new datasets should be standard practice in medical image analysis with deep learning.

**Author Contributions:** Conceptualization, Z.Y. and S.J.; methodology, M.K., K.K., F.Y. and H.Y.; software, M.K., K.K., F.Y. and H.Y.; validation, M.K., K.K. and Y.X.J.W.; formal analysis, M.K., K.K., F.Y. and H.Y.; investigation, M.K., K.K. and F.Y.; resources, Y.X.J.W. and Z.Y.; data curation, Y.X.J.W. and Z.Y.; writing—original draft preparation, M.K. and K.K.; writing—review and editing, F.Y., Z.Y. and S.J.; visualization, M.K. and H.Y.; supervision, Z.Y. and S.J.; project administration, S.J.; funding acquisition, S.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Data usage is exempt from local institutional review board review as it is publicly available from the TB portals program. The TB portals program participants are responsible for ensuring compliance with their countries laws, regulations, and ethics considerations.

**Data Availability Statement:** Links to datasets used in this study are provided in Section 3.

**Acknowledgments:** This work was supported by the Office of the Secretary Patient-Centered Outcomes Research Trust Fund (OS-PCORTF), under Interagency Agreement #750119PE080057, and by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. This project has also been funded in part with federal funds from the National Institute of Allergy and Infectious Diseases under BCBB Support Services Contract HHSN316201300006W/HHSN27200002.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

