*3.2. Data Preprocessing*

Python programming language platform is utilized for coding the proposed approach, and its various libraries like NumPy, pandas for better insight of data [30]. Different steps are taken to preprocess the imbalanced dataset, firstly by scaling and data cleaning by deleting ids, dropping duplicating rows, and filling all NA values. Moreover, categorical features are mapped to numbers. Furthermore, to convert the text features like (stress buster, what you miss most), pretrained bert is utilized for generating word vectors. Then words are mapped to a single feature by following the normalization formula as:

$$\chi = \frac{sum(vector)}{\max(vector) - min(vector)}.\tag{1}$$

Figures 2 and 3 represents variable count after and before sampling, whereas SMOTE (Synthetic minority oversampling technique) addresses imbalance class issues very effectively in various domains of research [31]. SMOTE oversampling technique is applied to resample student's datasets for COVID-19. Based on feature space similarity, the SMOTE approach combines extra minority samples [32]. Let *k* = nearest neighbor for xi using Euclidean distance.

Random Selection of *k* nearest neighbor Feature vector difference between *k* and *xi* Adding M in *xi* Equation (2) presents the formula for calculating SMOTE This is example 2 of an equation:

$$\mathfrak{x}\_{new} = \mathfrak{x}\_{i} \left( \mathfrak{x}\_{i}^{k} - \mathfrak{x}\_{i} \right) \times \delta. \tag{2}$$

*xk <sup>i</sup>* = A nearest neighbors of *xi*, and *δ* is an arbitrary value belongs to (0, 1).

**Figure 2.** Variable count before sampling.
