*2.3. Experiment Data Introduction*

SCB-13: The self-built dataset SCB-13 of this paper is made up of the above 13 participants' classroom behavior sensor data. The dataset will be used for later data analysis and model accuracy testing. Furthermore, we intend to provide a brief explanation of the experiment data of the back sensor from the perspective of an intuitive explanation.

‐ ‐ ‐ ‐

‐

‐

### 2.3.1. Multiple Channel Data Display

After separating the gathered data by 14 given motion patterns, the 65 sets of data for the same motion are averaged to eliminate individual motion differences. Figure 2 displays the processed 6-channel data when motion 6 (raising hand while standing up) is selected as a sample motion. The data demonstrate that motion occurrence and stable state can be acquired during the valid duration of motion and sitting still time, respectively. Notably, when the students get up and raise their hands, the *Z*-axis data of the accelerometer change the most, which is consistent with the actual situation. It confirms the viability of using sensors for behavior identification. ‐ ‐ ‐

‐ ‐ ‐ ‐ ‐ ‐ ‐ **Figure 2.** Take action 6 as an example to display the data of each channel of the back sensor. The lines from bottom to top represent the accelerometer *x*-axis(acc\_x), *y*-axis(acc\_y), *z*-axis(acc\_z), gyroscope *x*-axis(ypr\_x), *y*-axis(ypr\_y), and *z*-axis(ypr\_z). Valid segments of motions are shown within dashed lines.

#### 2.3.2. Display of Different Motions of the Same Participant

A participant was selected randomly, and his/her 4 common classroom behaviors (motion 1 sitting still, motion 5 turning around and looking around, motion 6 raising hand while standing up, and motion 8 standing up and sitting down) were displayed in the accelerometer(acc) channel and the gyroscope(ypr) channel, as shown in Figure 3. There are observable changes in the data between different motions of the same volunteer. Nonetheless, the data patterns of motions 6 and 8 are comparable to some extent, providing the classification a challenging problem.

‐ ‐ **Figure 3.** The randomly-selected four different actions of one of the participants and the data of the accelerometer and gyroscope data of the back sensor. The selected actions are as follows: motion 1 (sitting still), motion 5 (turning around and looking around), motion 6 (raising hand while standing up) and motion 8 (standing up and sitting down). We uniformly downsampled the data length to 200 for display clarity. Through motion sensors, we can continuously collect data about different motions, and each motion has a unique motion pattern. The relative intensity of each action is reflected in the ordinate after normalization. ‐ ‐

‐

‐

#### 2.3.3. Display of Different Participants with the Same Motion

Figure 4 shows the accelerometer data and gyroscope data of motion 6, raising hands while standing up, which were collected from 4 randomly picked individuals in order to display the differences in motion between various participants.

‐ ‐ ‐ ‐ **Figure 4.** In the same action mode, the data of four volunteers (id1, id2, id3, id4) are randomly selected for display. We uniformly downsampled the data length to 200 for display clarity. It was challenging to classify classroom behavior since each participant carried out the same action indifferent ways and had unique sensor data patterns.

The preceding diagram demonstrates that various participants have distinct motion pattern characteristics, even for identical motion. It may be caused by variances in personal posture and habitual behaviors. This necessitates that the established model has robust generalization performance, capable of identifying the distinctions between the characteristics of various motion patterns while allowing for modest variations within the same motion. A comparison of the accelerometer and gyroscope data determined that the gyroscope data has more complex properties and fewer noise points, making it more ideal for the learning and reasoning of the neural network. Before generating the network's standard input, it is necessary to address the extraction and separation of valid data segments since the same motion of different participants occurs at different times and lasts varying times. Taking into account the temporal features of the data, we attempt to extract valid segments of the entire motion time in this article, which was detailed demonstrated in the identification algorithm. ‐

‐ ‐ ‐

#### *2.4. Identification Algorithm*

Overall, the algorithm is divided into 3 stages: the extraction of valid segments based on the Dynamic Time Warping algorithm, data augmentation, and a deep learning-based classification algorithm. The whole process of the algorithm can be shown in Figure 5. Further, about the classification algorithm, we picked the most typical Deep Neural Networks (DNNs) as the classification benchmark and investigated the classification accuracy of the RNN-based method and the CNN-based method to explore the impact of various algorithms on the precise perception and identification of classroom behaviors. ‐ ‐ ‐ ‐

**Figure 5.** The framework of the whole process of the algorithm. The algorithm takes raw data as the input and outputs the most likely behavior from the 14 common classroom behaviors.

#### ‐ ‐ 2.4.1. Voting-Based DTW (VB-DTW) Valid Segment Extraction Algorithm

‐ Initially, we normalized the collected data to eliminate large differences in data values, which can hinder the convergence of the model. We scale the features contained in each channel by the maximum value and minimum value to the interval [0, 1] without affecting the numerical distribution:

$$X = \frac{X - X \text{min}}{\text{X} \text{max} - \text{X} \text{min}} \tag{1}$$

 െ ‐ ‐ Developing a distinctive and suitable method for feature representation is necessary in order to assess if motions can be accurately distinguished from the continuous and substantial stream of sensor data. The classification accuracy is determined by the algorithm's capacity to accurately extract the features in each motion sequence, particularly for sequences having temporal properties. Even though each motion's recommended acquisition time is equal, the valid duration of each motion varies due to participant differences during the acquisition process. The ratio of the motion's valid segment to its total time segment is insufficient for some motions (such as raising a hand on a seat, standing up and raising a hand, standing up and sitting down, and knocking on a table), making it challenging to identify motion patterns and represent motion features. To accurately identify the motion mode of each motion, we must differentiate the sitting still state from the valid duration data. In this context, we proposed an improved algorithm for signal extraction based on the Dynamic Time Warping (DTW) algorithm [35], which names as Voting-Based DTW (VB-DTW) valid segment extraction algorithm.

Since the valid motion segments are surrounded by the "sitting still" data in this work, we must divide the raw motion data into tiny sequences to efficiently locate the valid segment rather than process an entire motion segment directly. To extract the valid segments, we divide the raw motion data with a length of 50, which splits the entire 2000 motion data into 40 smaller slices. Utilizing the VB-DTW algorithm, we figure out the minimum warped path of 2 adjacent slices, a total of 39 warped path values yields for each motion from a total of 40 slices. The average warped path value of the motion is utilized as the threshold, and the combined vote of 4 neighboring warped paths is used to evaluate if the slices correspond to valid motion clips. The effective segment length of the final motion is determined by connecting the extracted valid segments. In addition, to address the issue of varied lengths for each extracted valid motion segment, we uniformly downsample the extracted valid motion segments to 285 in order to make model training easier. We apply the VB-DTW algorithm on the remaining thirteen types of motions except the sitting still since the complete motion sequence of the sitting still is a valid segment of the motion. As a result, we directly downsample the sitting still data to a length of 285. The whole process of the VB-DTW-extracted valid segment algorithm can be shown in Algorithm 1.
