**3. Experiments**

This section first describes the metrics for measuring the performance of our method, and then follows with the statistical analyses of the various classes and configurations to obtain a broad set of experimental results.

#### *3.1. Metrics*

The reported performance of the model is measured with two different metrics, i.e., stability and accuracy. The stability of the network is determined by the average amount of times the network switches from predicted label per video. Thus, if at time *tn*, the model predicts label *a*, and at *tn*+<sup>1</sup> the model predicts label *b*, the predicted labels switch from *a* to *b*. This change is counted as one domain switch. Since our dataset contains 5 labels, and each label should be passed only once, a perfect score would be 4 label switches. The accuracy is measured as the accuracy score averaged per patient, as shown in Equation (1). The label accuracy is averaged per patient, in order to normalize for a variable video length, and specified by

$$\text{Mean label Accuracy} = \frac{1}{N\_p} \sum\_{i=1}^{N\_p} \text{Acc}(L\_i), \tag{1}$$

where, *Li* denotes the label of event *i* and Acc(.) is the accuracy function. In Equations (1) and (2), *TP* is true positive, *Nv* is the number of frames, and *Np* is the number of patient cases. True positives are defined as in Table 1 and specified by

$$\text{Label Acc(L)} = \frac{1}{N\_v} \sum\_{j=1}^{N\_v} T\_{Pj}. \tag{2}$$

The explanation of Table 1 is as follows. For all rows where double or triple labels are indicated, we mean that if the predicted label has a ground truth in one of the two/three labels, then the prediction label is considered correct. For example, if Barrett's tissue is predicted and deemed correct, this means that the ground-truth labels from the clinician annotation may be 'Transition Z-line', 'Barrett' or 'Transition Squamous'.

This mapping of the annotations labels from the ground truth is chosen to compensate for the ambiguity of the transition zones, since the transition zones adjacent to a distinct class are now mapped onto that class. Consequently, this metric will better separate non-metaplastic and metaplastic tissue. This is important because a metaplastic area is the region of interest to examine for Barrett's neoplasia.

**Table 1.** Correspondence between predicted labels and ground-truth annotations of the clinicians.

