*3.2. Statistical Analysis*

The network performance is evaluated using a fivefold cross-validation. The dataset is divided into five equally-sized folds, such that in each fold each individual patient is represented once. The performance reported in this paper is the average accuracy over all folds and patients. To substantiate the statistical significance, we will also perform a Wilcoxon signed-rank test [22]. This is a nonparametric test applied to matched-pair data, which tries to find a distribution centered around zero based on differences.

## **4. Results**

The output stability is measured for four models, FC, FC Avg(n = 5), LSTM and GRU. We have found on average that the following networks switches from label: FC 43.27 (±23.83), FC Avg(n = 5) 18.48 (±9.76), LSTM 10.81 (±5.68) and GRU 11.91 (±6.33). These results demonstrate that the models implemented without RNNs, switch 2-4 more times from label within a single video. This is especially apparent in the video's in which the model has a bad performance, see Figure 3.

Table 2 displays the mean accuracy of the four different models for the five different tissue classes. The results show that by averaging over five consecutive frames, a performance improvement of 0.8% is obtained. The introduction of RNNs into the classification model results yield an increase of 3.7% in overall accuracy, as seen with the accuracies of 85.9% and 85.6% for LSTM and GRU, respectively, compared to 82.2% for FC. Detailed performances per model for each class are provided in confusion matrices in Figure 4. The Wilcoxon signed-rank test has found a *p* < 0.001 in all comparisons on accuracy of FC, LSTM and GRU classifiers, which confirms the statistical significance of the results.

**Table 2.** Mean accuracy per tissue class, and for various architecture configurations. Scores are averaged over all patient cases. The mean label accuracy is in correspondence with the label classification from Table 1. The classifiers reported are Fully Connected classifier, Fully Connected Averaged(n = 5), Long Short-Term Memory classifier, and Gated Recurrent Unit classifier.


An important observation is the accuracy of 98.3% on the Barrett's segment in the esophagus. A good performance on this label is crucial, since this model will be used in an a priori tissue classification to extend the robustness for lesion detection. High sensitivities are generally preferred in this field, since a false positive will only lead to an extra biopsy, while a false negative gives a severe detriment to the patient. The demonstrated accuracy score implies that roughly 1.7% of the images are rejected due to lesion detection caused by a false negative. We consider that this is acceptable because during inference, we are able to process up to 180 frames per second for real-time video analysis. This would mean that even if some frames are rejected, the time-gap between the analyzed frames would be small, so that the Barrett area will still be fully analyzed effectively.

**Figure 3.** Each bar illustrates the predicted organ labels per frame over a time axis. The figure illustrates the instability of the compared networks architectures at three different performance levels, which are best, median and worst performance for each model respectively. The average domain switch for Fully Connected classifier, is 43.27, Fully Connected Averaged(n=5) 18.49, Long Short-Term Memory classifier 10.81 and Gated Recurrent Unit classifier 11.91

**Figure 4.** Confusion matrices display the average percentage that a True Label is predicted as a specific class. These values are normalized for the patient cases. (**a**) Fully Connected classifier, (**b**) Fully connected Averaged(n = 5), (**c**) Long Short-Term Memory classifier, and (**d**) Gated Recurrent Unit classifier.

#### **5. Discussion and Conclusions**

In this work, we have explored the use of Recurrent Neural Networks (RNNs) for true temporal analysis of endoscopic video. In particular, we have evaluated two popular RNN architectures (i.e., Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU)) for tissue classification in endoscopic videos. This is a particularly interesting application, since current CAD systems show a relatively high number of false classifications for video frames captured outside the organ of interest. Reliably detecting the organ that is currently in view can therefore lead to an increased CAD performance. We demonstrate that by exploiting temporal information in latent space, much more stable classification behavior is observed than when simple frame averaging is used. Hence, the results confirm our hypothesis that by leveraging RNNs, we can stabilize the classification output from the model. Moreover, by learning the temporal flow we have also discovered an increase in the accuracy of all tissue classes. For the application of Barrett's cancer detection, the proposed system reliably detected the tissue of interest, i.e., the Barrett label, with an accuracy of 98.3%. These results are a proof of concept, and therefore the presented models do not yield the optimal results. In future work we will address this limitation by conducting an ablation study to find the optimal parameters.

The classification performance on the stomach and squamous tissue remains relatively poor. This discrepancy can be observed in Table 2 and is mostly caused by the definition of the label correspondence mapping in Table 1. Although the algorithm is able to approximate the tissue type, it often also guesses the neighboring tissue type. This error can be readily understood, as there is no hard defined transition on the visible border between tissue types, i.e., each view gradually transitions over time into the next one, resulting in the property that adjacent tissue areas (and labels) visually exhibit similar features (see Appendix A).

To address this transition ambiguity, a score based on the agreement between observers could be introduced. However, in our current training protocol, we only have one annotated label available per frame, originating from one of the three observers. By introducing multiple observers per frame, a score of agreement can be calculated (like simple majority voting), which can be used to train the future algorithm. Such an approach would take into account the ambiguity, and can then potentially also result into an additional score for ambiguity.

An other limitation is that the employed data is imbalanced at present. As can be seen in Table 2, the labels Stomach and Squamous are under-represented in the dataset. This imbalance is partly a reason for the poor performance on these classes. To overcome the limitation of available data, future efforts will focus on the collection of data, originating from other sources than videos alone.

In conclusion, our work has demonstrated that incorporating temporal information in endoscopic video analysis can lead to an improved classification performance. Exploiting the sequential bias present in endoscopic video (e.g., the order of the tissue types that are captured, in addition to a higher accuracy), also presents a more stable classification behavior over time. Although being directly applicable to EAC detection in BE patients, to likely enhance the CAD performance by reliably detecting the Barrett's tissue, our approach can be generalized and easily translated to similar endoscopic video analysis tasks. Future experiments should explore such novel applications and should focus on combining the proposed pre-processing system with several succeeding, and already established classification tasks.

**Author Contributions:** Conceptualization, T.B., J.v.d.P.; methodology, T.B.; validation, T.B.; resources, M.S. and K.F., J.J., E.S.; writing—original draft preparation, T.B.; writing—review and editing, P.d.W.; supervision, F.v.d.S.; funding acquisition, P.d.W. and J.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

*Sensors* **2020**, *20*, 4133
