*2.3. Network Architectures and Training Protocol*

Our network architecture is split in two principal stages: a feature extractor followed by a classification network. The feature extraction is based on a modified ResNet18 network, containing four fully convolutional blocks. This model is applied in many fields for image classification, and should therefore allow for an easier comparison to other literature. Different classifiers are evaluated in our experiments. Since we only want to measure the effect of applying RNNs on the model robustness, we keep the classifiers as simplistic as possible. The following classifier configurations are tested: (1) two Fully Connected layers (FC) with a Rectified Linear Unit function in between, (2) a two-layer LSTM directly followed by a fully connected layer, and (3) a two-layer GRU directly followed by a fully connected layer. Both LSTM and GRU classifiers are assembled with two recurrent layers, since according to Graves et al. [21] depth is needed to increase the receptive field of an RNN. Both LSTM and GRU are trained with a hidden state size of 128 parameters. Since relevant literature also describes methods to exploit temporal information based on single image features, we also apply a simplistic averaging filter over 5 frames to the FC classifier in order to smooth the output over the temporal domain. This classifier is denoted as FC Avg(n = 5).

We introduce random under-sampling for training set stratification, by taking into account the asymmetry of the dataset. Doing so, each video frame is assigned a weight, so that the probability of sampling each class and case, is equal. Per iteration, we randomly sample 512 sequences consisting out of only one or 10 consecutive frames, to facilitate individual or temporal classification, respectively. The flow of data through our proposed model for temporal classification is illustrated in Figure 2.

**Figure 2.** Data flow through the proposed model during training. Recurrent Neural Network (RNN) represents either Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) nodes.

The network is trained using Adam optimization with an initial learning rate of 10−4. A cyclic cosine learning-rate scheduler is used to further control the learning rate. A cross-entropy loss is implemented to converge the neural network. For data augmentation, random-affine transformations are applied during training with rotations up to 5 degrees, translation and cropping of 2.5% of the image length and shearing up to 5 degrees. These parameters are kept constant per sequence.
