**Algorithm 1: Voting-Based DTW (VB-DTW) valid segment extraction algorithm**

**Input:** Segments of original data : *S<sup>i</sup>* , *i* = 1, 2, 3, . . . , *N*, *where N* = 40; For each time sequence : *S<sup>i</sup>* = *S* 1 *i* , *S* 2 *i* , *S* 3 *i* , . . . , *S<sup>M</sup> i* , *where M* = 50. **Initialization:** *Dv* = { }, *Sv* = { } voting set = <sup>n</sup> *Dj*−<sup>2</sup> , *Dj*−<sup>1</sup> , *D<sup>j</sup>* , *Dj*+<sup>1</sup> , *Dj*+<sup>2</sup> o 1: **while** 1 ≤ *i* ≤ *N* − 1 **do** 2: *D<sup>i</sup>* ← *DTW*(*S<sup>i</sup>* , *Si*+1) 3: **end while** 4: *threshold* <sup>←</sup> <sup>1</sup> *<sup>N</sup>*−<sup>1</sup> <sup>∑</sup> *N*−1 *<sup>i</sup>*=<sup>1</sup> *D<sup>i</sup>* 5: **while** 3 ≤ *j* ≤ *N* − 2 **do** 6: **for** *Dvalue in voting set* **do** 7: count ← 0 8: **if** *Dvalue* > *threshold* **then** 9: *count* ← *count* + 1 10 : **if** *count* ≥ 3 **then** 11 : *Dv* ← *Dv* ∪ *j* 12: *j*; *j* ← *j* + 1 13: **end while** 14: **for** *j in* {1, 2, *N* − 1, *N*} **do** 15: **if** *D<sup>j</sup>* > **then** 16: *Dv* ← *Dv* ∪ *j* 17: *Sv* ← *Sv* ∪ n *SD<sup>k</sup> v* , *SD<sup>k</sup> <sup>v</sup>*+1 o (*k* = 1, 2, 3, . . . *length*(*Dv*)) **Output:** DTW value of time series : *D<sup>j</sup>* , *j* = 1, 2, 3, . . . , *N* − 1; Valid segment slices : *Sv*, *S<sup>v</sup>* ∈ *S<sup>i</sup>* .

#### 2.4.2. Data Augmentation

Data augmentation assists in resolving the overfitting issue caused by insufficient data sets during model training. Contrary to the data augmentation methods for image data, time series data augmentation confronts several formidable obstacles, including 1. the fundamental features of time series sequences are underutilized, 2. different jobs necessitate the use of distinct data augmentation techniques, and 3. the issue of sample category imbalance.

Traditional time series data augmentation methods can be subdivided into time domain-based data enhancement to convert original data or to inject noise; frequency domain-based data enhancement converts data from the time domain to the frequency

domain and then applies enhancement algorithms; and simultaneous time domain and frequency domain analysis. To prevent the issue of model overfitting caused by insufficient data, to strengthen the model's robustness, and to generate a high number of data samples, we use the window slicing-based method as the data enhancement technique. Window slicing separates the original data of length n into n − s + 1 slices with the same label as the raw segment, using S as the new slice length. During the training process, each slice is sent to the network independently as a training instance for prediction. During testing, the separated slices are also submitted to the network, and the majority vote is utilized to determine the original segment's label. In this model, we select a slice length of 256, which corresponds to approximately 90% of the original length of 285. Figure 6 depicts the data augmentation method, which divides the down-sampled valid motion sequence into 30 new slices. ‐ ‐ ‐ ‐

‐ ‐ ‐ ‐

‐ ‐

‐

‐

௩

, <sup>௩</sup> ∈

‐ ‐ ‐ ‐ ‐ **Figure 6.** The detailed data augmentation procedure. The green/orange/blue lines represent the sensor data for a motion. The detailed data augmentation procedure. The green/orange/blue lines represent the sensor data for a motion. The sample window size is 256, and the stride size of the window is 1. We received 30 identical labeled data for each 285-length motion data after data augmentation.

#### ‐ 2.4.3. Deep Learning-Based Classification Algorithm

‐ ‐ ‐ ‐ ‐ ‐ We explored 2 categories, Recurrent Neural Networks (RNN) based methods and Convolutional Neural Network (CNN) based methods. The RNN-based method tries to represent data attributes based on temporal properties. Long Short Term Memory network (LSTM) [36] and Bidirectional Long Short Term Memory network (BiLSTM) [37] are the specific algorithms chosen for RNN-based methods. The CNN-based method can extract features by performing convolution on the data and focusing on the data's spatial characteristics. The chosen method for CNN-based methods is 1DCNN [38]. The basic deep neural network (DNN) is chosen as a simple benchmark model that aims to evaluate the performance of various algorithms from these 2 categories. The reason for comparing these four models is that this paper aimed to explore the more classical, advanced, and effective models of temporal data processing for the performance of perception and identification of students' classroom behavior tasks. The choice of these four classical models helped us to achieve the goal of presenting the best results of our model compared to the rest of the models.

#### (1) LSTM and BiLSTM

Recurrent neural networks (RNNs) are uniquely valuable compared to other neural networks for processing interdependent sequential data, such as text analysis, speech recognition, and machine translation. It is also widely used in the field of sensor-based motion recognition due to its property of recursion in the direction of sequence evolution, and all recurrent units are linked in a chain [39].

However, the conventional RNN has a short-term memory problem because the RNN cannot memorize and process more comprehensive sequence information, as the layers

in the pre-recursive stage will stop learning due to the vanishing gradient problem or exploding gradient problem caused by backpropagation. For the problem that the later data input has more influence and the earlier data input has less influence on RNN, in 1997, Hochreiter and Schmidhuber proposed the Long Short Term Memory Network (LSTM), which successfully solved the limitation of RNN in processing long sequence data and was able to learn the long-term dependence of sequence data features. LSTM proposed the internal mechanism of 'gates' used to regulate the flow of feature information, including input gates that control the reading of data into the unit, output gates that control the output entries of the unit, and forgetting gates that reset the contents of the unit. The specific LSTM structure is shown in Figure 7, and a new vector C representing the cell state is added to the LSTM. ‐ ‐ ‐ ‐

‐

‐

‐

‐ ‐

‐

‐

‐

‐

‐

− − **Figure 7.** LSTM structure, W<sup>f</sup> is the forgetting gate, W<sup>i</sup> is the input gate, Wo is the output gate, xt is the input data, ht−<sup>1</sup> is the neural node of the hidden state, and W<sup>f</sup> is used to calculate the features in ct−<sup>1</sup> to obtain ct .

‐ ‐ ‐ Both traditional RNN and LSTM can only predict the output of the next moment based on the information of the previous moment. While in practical applications, the information of the next moment may also have a significant influence on the output state of this moment. Bi-directional LSTM (Bi-LSTM) combines 2 traditional LSTM models and uses 1 of them for forward input and the other for reverse input to fuse the information of the previous and subsequent moments for inference. Its structure is shown in Figure 8.

‐ **Figure 8.** Bi-LSTM structure, which combines forward LSTM and backward LSTM.

‐

‐ ‐

‐

‐

‐

‐

<sup>(2)</sup> 1DCNN

One-dimensional convolutional neural networks (1DCNN) have strong advantages for sequence data because of the powerful ability to extract features from fixed-length segments in 1-dimensional signals. Also, the adaptive 1DCNN only performs linear 1D convolutions (scalar multiplication and addition), thus providing the possibility of real-time and low-cost intelligent control over hardware [40]. The basic structure of 1DCNN is shown in Figure 9. The kernel moves on the sequence data along the time axis to complete the feature extraction of the original data. ‐ ‐ ‐ ‐ ‐

**Figure 9.** 1DCNN structure. The structure of 1DCNN mainly includes input, hidden layer, and output, so as to achieve the purpose of feature extraction.

‐ ‐ ‐ ‐ ‐ ‐ In conclusion, the algorithm utilized the VB-DTW algorithm to extract valid segments, and then window slicing was used to augment the data and achieve a 30-times dataset increase. For classification, we employ 2 categories of networks. For the RNN-based method, the LSTM network and Bi-LSTM network are chosen, as well as the 1DCNN for the CNN-based method. These 2 different types of networks' abilities and contributions to percept and identify students' classroom behavior are assessed.

2.4.4. Evaluation Metrics

‐

#### (1) Valid Segments Extraction

‐ ‐ ‐ In order to demonstrate the accuracy of the valid segments obtained by the VB-DTW algorithm, we hand-crafted labeled the indices of all valid motion segments as the benchmark. We measure the similarity between the index of extracted data slices (represented as A) and the benchmark (represented as B) using the Jaccard index. The Jaccard index is used to determine the degree of similarity between limited sample data and is defined as the sample intersection size divided by the sample union size. The equation is:

$$J(A,B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|} \tag{2}$$

#### (2) Motion Identification

In order to verify the classification performance of the model, we usually use the accuracy rate to characterize it, that is, the proportion of the number of samples with accurate classification (represented as a) to the total number of samples (represented as m) of this type. Expressed by the following formula:

$$accuracy = \frac{a}{m} \tag{3}$$

#### **3. Results**

In summary, based on the need to understand the classroom behaviors of school children in educational scenarios, sensor-based devices provide an effective way to identify classroom behaviors intelligently. Therefore, this paper proposes the VB-DTW algorithm based on wearable sensors combined with artificial intelligence technology to achieve intelligent recognition of school children's classroom behaviors. Based on the recognition results, it is possible to provide immediate feedback on students' classroom performance and help them improve their learning performance while providing an essential reference basis and data support for constructing an intelligent digital education platform.
