**4. Results**

#### *4.1. Model Prediction Results*

Table 3 lists the model prediction results based on the difference between the clustering and resampling methods. The model using spatial clustering showed higher F1 scores than the calculation method that combined the adjacent cells. Accordingly, there are differences in the physical environmental factors influenced by the detailed method of crime. Based on the findings of previous studies, a crime involving the same offender is likely to occur around the area of the original crime, indicating that a repeat of the crime is more likely in areas with similar spatial features to the area where the crime occurred. For both the SMOTE and random undersampling techniques, when the minimum threshold for a cell was *n* = 6, the F1 score was the highest, at 33.85% and 34.90%, respectively, and the F1 score increased by approximately 2% compared to the method combining the adjacent cells. In the models using the max-p method, the SMOTE-based model showed a regular pattern in which the F1 score gradually decreased as the distance from *n* = 6 increased, whereas the F1 score in the random undersampling-based model showed an irregular pattern according to the *p*-value. The results show low stability because the random undersampling method randomly deletes the instances. The pattern of the F1 score in the SMOTE-based model indicates that the model's performance may decrease if the *p*-value is too small or too large, and that there is a value yielding the optimal performance.

**Table 3.** Model performance according to resampling method and minimum threshold.



**Table 3.** *Cont.*

Comparing the average accuracy and F1 scores of the models according to the resampling method, the SMOTE and random undersampling methods showed accuracies of 86.81% and 77.14% and F1 scores of 32.97% and 33.54%, respectively. Therefore, the SMOTE method had a 10% higher accuracy and a 0.5% lower F1 score than the random undersampling method. The random undersampling-based model showed a recall of approximately 55% to 62%, predicting many crime classes out of the total data. However, the precision and accuracy values were generally lower than those of the SMOTE, showing that its ability to accurately predict crime was inadequate. Figure 6 shows the models' prediction results according to the resampling method using a confusion matrix (*n* = 6). The value in the second quadrant is the number of data that correctly predicted cold spots (i.e., where no crime occurred), and the value in the fourth quadrant is the number of data that correctly predicted hot spots (i.e., where the crime occurred). The value in the first quadrant is the number of data points that incorrectly predicted a hot spot where the actual data were cold spots. The value in the third quadrant is the reverse (i.e., points that incorrectly predicted a cold spot where the actual data were hot spots). Considering the SMOTE method, 204 of the 584 data predicted as crime classes were correctly predicted, and for the random undersampling method, 386 of the 1591 data were correctly predicted. Because random undersampling deletes data from the majority class among all the data, precise prediction is difficult because of information loss.

**Figure 6.** Confusion matrix results according to the resampling method (*n* = 6).

#### *4.2. Feature Importance*

In the case of the random forest algorithm, the feature importance function can be used to numerically express the influence of each variable for the prediction. Accordingly, this study analyzed the relative importance of each variable using this function. According to the analysis, the distribution of the feature importance varied with the resampling method (Figure 7). The feature importance was more evenly distributed under the random undersampling method than the SMOTE method. Because the random undersampling

method reduces the size of the entire dataset for training, the model is more sensitive to the features of the data with fewer samples.

**Figure 7.** Feature importance chart, following the resampling methods.

For both the random undersampling and SMOTE, time-related variables were the highest. Among these, the variable related to the average number of crimes that occurred in the cell over the previous year was the most important. In this regard, because crimes generally do not occur frequently, when the analysis period is shorter, less information can be learned from the variable. In contrast, crimes that occurred within the cluster showed different patterns according to the resampling method. Considering the random undersampling, the variables related to the average number of crimes during a particular period showed a higher importance as the period increased to six months, nine months, and 1 year, respectively. However, with regard to the SMOTE method, the influence of recent crimes was high at three, six, and nine months. Because the SMOTE method generates new instances by interpolating the data, a large amount of data can be trained. Moreover, because the crime data created in the clustered instances are used together for the training, sufficient crime-related information can be obtained, even for short periods. While more recent crimes are known to generally have a greater influence on future crimes, in crime prediction research using machine learning it is important to appropriately configure the time-related variables, considering the training of the algorithm. Considering the physical environment-related variables, when using random undersampling, general restaurants showed the highest importance, followed by rest-area restaurants and pubs. When using the SMOTE method, the order of importance was rest-area restaurants, general restaurants, and pubs. However, the importance of residential buildings, banks, and CCTV-related facilities is relatively low. As is similar to the findings of previous studies on the influence of the surrounding environment, the likelihood of becoming a target of repeated crime is high when there are insufficient factors that can deter crime in places where people frequently engage in routine activities. Therefore, it is necessary to identify places where crime is spatiotemporally concentrated based on crowded spaces, predict where crime is likely to occur, and strengthen crime prevention activities in those places.
