*3.5. Resampling*

Imbalanced data are those in which the distribution is overconcentrated in a specific class. In imbalanced data, the minority classes are recognized as noise in the training process, and the classification does not proceed correctly, which may adversely affect performance [25]. The theft data used in this study are also imbalanced; only approximately 10% of the total data correspond to the theft class. Accordingly, this study used random undersampling and the synthetic minority oversampling technique (SMOTE) to solve the problems due to data imbalance.

Random undersampling is a resampling technique that randomly deletes instances from the majority class to balance distribution with the minority class. When there are many training sets, it is possible to increase the learning speed and reduce the data capacity by decreasing the number of samples. However, because this technique involves deleting data, there is a risk of information loss. The SMOTE is an oversampling technique that interpolates data in the minority class to create new instances to balance the data. Whereas this results in a slower training speed than undersampling, there is no risk of data loss, and overfitting is less likely to occur than random oversampling, which randomly replicates the minority data.

#### *3.6. Model Training and Evaluation*

#### 3.6.1. Model Training

The data preprocessed via the above procedure were trained using a random forest, a tree-based machine learning algorithm. Random forest, an ensemble technique widely used in general classification problems, creates multiple decision trees and combines the output of each decision tree. This study used the random forest technique to build a crime prediction model and then compared each model.

First, as the range of values for each variable differed, the values of the data were normalized using min-max scaling. The ratio between the training set and test set is generally set to between 7:3 and 8:2; nevertheless, this is flexible, depending on the amount of data and the research method. The purpose of crime prediction is to predict future crimes based on those crimes that occurred in the past. As such, the data from 2014 to 2016 were used as the training set, and those from 2017 were used as the test set. K-fold cross-validation was applied to each model in the training process to prevent the bias and overfitting that might occur when repeatedly performing the training using only the training and test sets [26,27]. In the k-fold cross-validation, the test set was divided into k-folds, and training and validation were performed sequentially. The K value typically ranges from five to ten. In this study, it was set to five. After the cross-validation, the parameters of each model were adjusted to obtain the optimal performance. In this study, the grid search CV of the Python scikit-learn library was used to adjust the parameters and found those parameters with optimal performance for each model.

#### 3.6.2. Model Evaluation

Because the crime data used in the training were imbalanced, it was difficult to determine how well the minority class was predicted by evaluating the model with any accuracy, and this was a classification indicator for the entire dataset [28]. Therefore, suitable methods for evaluating imbalanced data must be considered. This study evaluated the performance of each model using a confusion matrix [29,30], which is primarily used when evaluating the performance of general algorithms and imbalanced data. The confusion matrix compares the results predicted by the model with the actual class in the data, and classifies them as TN, TP, FP, or FN. Using this, the precision and recall values were obtained and harmonized in order to calculate the F1 score. The accuracy and F1 score of each model were compared to evaluate prediction performance (Figure 5).


**Figure 5.** Example of a confusion matrix.
