2.2.5. Deep Learning

After outperforming conventional methods in various other fields like speech recognition, visual object recognition, and object detection [34], the methodology of deep learning started to become more and more popular for IMU data processing. Hannink et al. [16] introduced a deep convolutional regression network for calculating the stride length from raw IMU data in geriatric patients. The network learned a model for stride length regression based on raw IMU data without any domain knowledge. In this work, we used an adapted architecture for the stride length computation in running gait, which is depicted in Figure 6. It consisted of two convolutional layers, two max pooling layers, one flattening layer, and two fully-connected layers. For the implementation of the architecture, we used Keras [35] with a TensorFlow backend [36].

**Figure 6.** Architecture of the convolutional neural network for stride length regression based on the raw 6D-IMU signal. For the first convolutional layer, we used *N*<sup>1</sup> = 32 filter kernels of kernel length *K*<sup>1</sup> = 30. The second convolutional layer consisted of *N*<sup>2</sup> = 16 filter kernels of kernel length *K*<sup>2</sup> = 15. The first fully-connected layer had *M*<sup>1</sup> = 128 outputs that served as input to the second fully-connected layer, which had only a *M*<sup>2</sup> = 1 output. This output represented the computed stride length.

Before feeding data into the network, the segmented 6D-IMU data of a single stride were zero padded to 200 samples to assure a constant number of samples as an input to the network. One convolutional layer consisted of *N* convolution filters. The *N* outputs of a convolutional layer *O*(*j*) with *j* = 1 ... *N* are called feature maps and were computed by the convolution of the six IMU input channels *xc* with *<sup>c</sup>* <sup>=</sup> 1...6 with the filter kernel *<sup>φ</sup>*(*j*) *<sup>c</sup>* of length *K*, adding biases *b* (*j*) *<sup>c</sup>* and finally applying a ReLU activation function:

$$\mathbf{U}^{(j)} = \text{ReLU}\left(\sum\_{c=0}^{6} (\Phi\_c^{(j)} \times \mathbf{x}\_c + b\_c^{(j)})\right) \tag{9}$$

This formula has to be applied for all *j* = 1 ... *N* filters to produce *N* feature maps *O*(*j*) after each convolutional layer. Thus, the two tunable parameters in the convolutional layers are the number of kernel coefficients *K* and the number of filters *N*. In the first convolutional layer, the kernel size was *K*<sup>1</sup> = 30 and the number of filters *N*<sup>1</sup> = 32. In the second convolutional layer, the kernel size was *K*<sup>2</sup> = 15 and the number of filters *N*<sup>2</sup> = 16 filters. After each convolutional layer, the resulting feature map was fed into a max pooling layer, which downsampled the resulting feature map by a downsampling factor of two by taking the maximum in non-overlapping windows of size two.

After the second max pooling layer, the feature map was flattened to produce a one-dimensional feature list that can be fed into the fully-connected layers. Thus, the flattening layer appended the *N*2-dimensional output of the second max pooling layer after each other into one feature list. The two fully-connected layers at the end of the architecture computed a weighted sum of all *k* = 1 ... *Nf* input features *ϕ<sup>k</sup>* of the one-dimensional feature vector with weights *wk*,*<sup>j</sup>* and added biases *bk*. A ReLU function again activated the positive features.

$$F\_j = \text{ReLU}\left(\sum\_{k=0}^{N\_f - 1} (w\_{k,j} \cdot \varphi\_k + b\_{k,j})\right) \tag{10}$$

The outputs of the fully-connected layers were feature lists *Fj* with *j* = 1 ... *M*, where *M* describes the number features. In our architecture, the first fully-connected layer had *M*<sup>1</sup> = 128 output features. The second fully-connected layer had only *M*<sup>2</sup> = 1 output feature, which was the resulting target value. In our implementation, the regressed target value was the stride length.

To prevent overfitting, we also added a dropout layer to our network [37]. The dropout layer was stacked between the two fully-connected layers and dropped 30% of the neurons. During training, we fed the data into the network in five epochs with a batch size of 16. We trained the network both for the stride length and for the velocity. The network with the stride length as the output outperformed the velocity approach and was therefore used for the evaluation in this publication. Thus, the velocity *vstride* for the *Deep Learning* approach was computed by dividing the stride length *dstride* obtained from the neural network by the stride time *tstride* obtained from the stride segmentation.
