**1. Introduction**

Artificial intelligence (AI) systems exceeding expert performance have shortcomings when they are applied on data outside their training domains. At present, such AI systems lack a form of context awareness, which allows the model to reject data outside its learned feature space. Since medical examinations often include an extensive range of anatomical checks, there is a risk that AI-based automated lesion detectors will be applied outside the target domain. Potentially, when inexperienced clinicians are relying on the algorithm, this might lead to higher false positives and false negatives and thereby to malignancies in the diagnosis, which is to the detriment of patients. Assistive tools for automatic lesion detection should therefore be designed for robustness and accuracy with the standard clinical practice in mind.

In the field on gastroenterology, a Computer-Aided Detection (CAD) system has been developed for Barrett's neoplasia detection in white light endoscopic still images [1], achieving expert performance. Yet, this algorithm is restricted and validated on the visual features of a Barrett's esophagus (BE). In clinical practice, it is common to fully assess the esophagus from stomach to the healthy squamous region. Therefore, the current model should only be restricted to the analysis of the Barrett's region of the esophagus.

In order to facilitate the continuous analysis of the video signal during the full clinical protocol, a vast pool of new relevant and irrelevant features needs to be taken into account. For example, optical tissue deformation, which can be estimated through consecutive frames, is an inherent cell marker for testing malignant morphological changes, according to Guck et al. [2]. In contrast, when ambiguous frames are introduced, the model could become unstable according to Van der Putten et al. [3]. In order to deal with such ambiguity, the model should consider the context of prior frames for robust and reliable decision making. The consecutive frames in an endoscopy procedure do not differ substantially, and therefore information prior to an ambiguous frame can be exploited to make an accurate prediction. Such sequential models could be used to improve position tracking in during an endoscopy procedure. Accordingly, since Esophagus Adenocarcinoma (EAC) only occurs in a particular segment of the esophagus (i.e., in BE), frames that are captured outside this segment could be disregarded by an EAC detection algorithm, leading to a reduction in false alarms and an increased user confidence in the CAD system.

Practically, different approaches and algorithms have been applied on time-series data, including independent frame analysis, averaging over the temporal domain and hidden Markov models [4–8]. However, the absence of long-term memory in these models hampers the exploitation of long-distance interactions and correlations, which make the corresponding algorithms not suitable for learning long-distance dependencies typically found in clinical data. Since the employed, existing image-based classification networks are trained on still images in overview, the response on unseen non-informative frames is unknown. This implies that algorithms trained only on still images do not perform well on video signals without algorithm modifications. [9]

Recurrent Neural Networks (RNNs) can be used to provide a temporal flow of information. These networks have been widely used to learn the processing of sequential video data and are capable of dealing with long-term dependencies. In this type of artificial neural network, connections are formed between units and a directed cycle. This cycle creates an internal state of the network which allows it to exhibit and model dynamic temporal behavior without computation-intensive 3D convolutional layers. Recently, Yao et al. [10] demonstrated a state-of-the-art method for action recognition, which imposes Gated Recurrent Units (GRUs) on the deep spatiotemporal information extracted by a convolutional network. Furthermore, Yue et al. [11] and Donahue et al. [12] have successfully demonstrated the ability of RNNs to recognize activity, based on a stack of input frames. A similar approach could be followed for the classification of tissue in videos, thereby potentially leading to a more temporally stable algorithm, since it is able to exploit information existing in the temporal domain.

The literature describes a variety of methods to analyze video for classification tasks in endoscopy. The most basic form for video analysis describes a single frame based analysis for classification [13–15]. Other recent work on video analysis in endoscopy focuses on a frame-based analysis approach with additional post processing to yield some form of temporal cohesion. Byrne et al. [16] describe a frame-based feature extractor, which interpolates a confidence score between consecutive frames, in order to make a more confident prediction for colorectal polyp detection. De Groof et al. [9] implement a voting system for multiple frames on multiple levels. Yu et al. [17] describe a 3D convolutional model, in order to capture inter-frame correlations. Yet, 3D convolutions fail to capture long-term information. Harada et al. [18] propose an unsupervised learning method, which clusters frame-based predictions, in order to improve temporal stability in tissue classification. Yet, a clustering approach is not able to capture the consecutive or inter-frame correlation between frames. Frameworks

that do actively learn spatiotemporal information with the implementation of RNNs are described by, Owais et al. [19] and Ghatwary et al. [20]. They demonstrate that the implementation of RNNs yield superior classification accuracies in endoscopic videos, but no quantitative results are reported on the stability of the employed models.

In this paper, we address the ambiguity in the classification of tissue in the upper gastrointestinal tract by introducing RNNs, as a first exploratory study to obtain a more robust system for endoscopic lesion detection. Our system is generally applicable for CAD systems in the gastrointestinal tract and can potentially serve as a pre-processing step that reduces the amount of false alarms for a wide range of endoscopic CAD systems. We hypothesize that by extending Resnet18 with RNNs, or more specifically, by employing Long Short Term Memory (LSTM) or Gated Recurrent Unit (GRU) as concepts, the model is able to actively learn and memorize information seen in earlier frames to make a more accurate prediction about the tissue class compared to networks without temporal processing.

Our contributions are therefore as follows. First, our work demonstrates that including temporal information in endoscopic video analysis leads to an improved classification performance. Second, We show that exploiting the concepts of LSTM and GRU outperform the conventional Fully Connected (FC) networks. Third, the proposed approach offers a higher stability and robustness in classification performance, so that it paves the way for applying automated detection during the complete clinical endoscopic procedure.

#### **2. Materials and Methods**

#### *2.1. Ethics and Information Governance*

This work and the involved local collection of data on implied consent, received national Research Ethics (IRB) Committee approval from the Amsterdam UMC (No. NTR7072). De-identifcation was performed in line with the General Data Protection Regulation (EU) 2016/679.

#### *2.2. Datasets and Clinical Taxonomy*

To train and evaluate the classification performance of our networks, we collected a dataset consisting of 82 endoscopic pullback videos from 75 patients, which were recorded prospectively in the Amsterdam UMC. Written informed consent was obtained from all participants. In total, 46 out of the 82 videos were derived from patients who were diagnosed with high-grade dysplasia. These videos were captured using White Light Endoscopy in full high-definition format (1280 × 1024 pixels) with the ELUXEO 7000 endosco py system (FUJIFILM, Tokyo, Japan). During the recording of a pullback video, the endoscope is slowly pulled from the stomach up to the healthy squamous esophagus tissue in one smooth sequential movement.

In our processing, we have sampled with 5 frames per second at a resolution of 320 × 256 pixels. Each resulting frame is manually labeled by one out of three experienced clinicians with respect to tissue class and informativeness. The five tissue classes are 'stomach', 'transition-zone Z-line', 'Barrett', 'transition-zone squamous' and 'squamous', see Figure 1. Frames are labeled as non-informative if they have: out-of-focus degradation, visible esophagus contractions, video motion blur, broad visibility of bubbles, or excessive contrast from lighting. For the training, we have only selected sequences in which the last frame is labeled informative. From the total dataset consisting of 20,663 frames, 19,931 are labeled as informative by a team of three clinical research fellows.

**Figure 1.** Visual examples for each tissue label to be classified by the model sourced from pullback videos. These pullback videos start recording from the (**a**) Stomach and stops, while pulling the endoscope with a constant speed, at the (**e**) Squamous area. (**a**) Stomach; (**b**) Transition-zone Z-line; (**c**) Barrett; (**d**) Transition-zone Squamous; (**e**) Squamous.
