*2.4. Architecture of the Deep Network*

The deep learning model developed to classify the XC-skiing techniques is motivated from Deepsense proposed by Yao et al. [22], which in the authors' words "provides a general signal estimation and classification framework [for regression and classification problems] that accommodate a wide range of applications." The training, validation, and testing data of our problem is arranged into 3D tensors, where each matrix corresponds to one training (or testing) example and has 333 rows and 51 columns. Each column contains data along one of the axes as recorded by one sensor and represents a time-series with 333 time steps. Thus, the classification problem is posed as a multivariate time-series sequence classification task. To capture the interactions among these time series, they are passed through convolutional layers. Deepsense first performs fourier transformation on the raw data of each sensor, passes the frequency data of each sensor through convolutional layers individually, and then combines the data of all the sensors to pass it through another convolutional layer. This approach requires a greater number of convolutional layers and selecting the number of frequencies that must be passed to the network, which introduces an element of human decision-making. We solve this problem by making 2 simple modifications to our model: (i) We pass the raw sensor data to the convolutional layers instead of the frequency data, and (ii) we convolve the raw data of all the sensors in a single step, which reduces the total number of convolutional layers required for training.

As the skiers may perform the same skiing techniques at varying speeds and intensities, it is necessary that the deep network layers be robust to the scale of the data and be able to capture features that may be found at different time-steps in the time-series. CNNs are very powerful in extracting local spatial coherence and dependencies in the data, and the scale invariance introduced by the max-pooling layers allows them to learn hidden features regardless of the position of the feature or its scale. We pass the raw data through two 1D convolutional layers with 64 filters each and 20 and 10 kernels, respectively, followed by a max-pooling layer with a pool-size of 4. This convolved data is then passed through 2 long-short term memory (LSTM) layers, with 300 and 200 units, respectively, to capture long-term temporal dependencies in the time series. As LSTMs are highly prone to overfitting, a dropout layer with a dropout probability of 0.2 is added after each LSTM layer. The network is trained over 12 epochs with a batch size of 40. Various steps involved in data preprocessing and the architecture of the deep network are summarized in Figure 3.

**Figure 3.** (**Left**) A schematic view of the steps involved in preprocessing of the raw data, and (**Right**) the architecture of the deep neural network.
