2.3.3. CNN

The structure of CNNs is inspired by the biological structure of a brain. Both consist of repeating layers of simple and complex cells to solve segmentation, detection, and localization tasks [36]. The first CNNs were presented in the late 1980s, e.g., by [45] for the recognition of handwriting digits. Nowadays, they are the leading model for image classification, detection, and recognition tasks [36]. Each convolutional layer of a CNN extracts features and local conjunctions of the previous layer with weighted neurons. For this, kernels of a certain size are used to pass over the feature map or filter, and forwarded to a nonlinear activation function, e.g., rectified linear units (ReLU) [46]. There are two commonly applied techniques to simplify and aggregate the outputs of a convolutional layer. The first is to insert pooling layers. For this, features are merged (e.g., using the maximum or average value) with a pooling kernel to reduce the spacial resolution and decorrelate the features [47]. The second is the use of strides instead of pooling. Strides describe the step size of the kernel, and by increasing their size, the spatial resolution can be reduced. They are useful when input sizes are small [48] and are also utilized in more complex architectures such as ResNet to achieve higher accuracy and increase the training and classification speed [49]. Several convolutional layers in series can derive abstract features of the input. Fully connected layers of neurons and weights, as in standard neural networks, are attached to this to interpret these abstract features. For classification problems, in general a softmax function is used as the activation function in the last fully connected layer [46].

The CNN applied in this study was created with *TensorFlow's Keras* Python API (version 2.3.1). Its structure is shown in Table 2. Two convolutional layers, the first with 32 filters, the second with 128 filters, and two fully connected layers, the first of size 64, the second of size *n*, which is the number of output classes, were implemented. A softmax activation function in combination with a cross-entropy loss function (also known as categorical cross-entropy loss [50]) was used in this last layer to give a probability for the predicted output. The model utilizes *Adam* as an optimizer because it showed good results for CNNs [51]. Strides are applied within the convolutional layers to aggregate the features. A ReLU activation function is used for the two convolutional layers and the first dense layer. The performance of the CNN is improved via batch normalization [52]. To reduce overfitting and improve generalization, the L2 kernel regularizer and dropouts are applied as regularization methods [22,53].
