(**a**) Crime prediction combining adjacent cells

(**b**) Crime prediction reflecting spatial similarity 

**Figure 1.** Proposed crime prediction using spatial clustering.

The flow of this study is shown in Figure 2. Dongjak District, Seoul, the target analysis area, is divided into grids of 100 m × 100 m on GIS. After inserting data on the physical environment in each cell, such as crimes, facilities, and land use that occurred previously, the target sites are clustered according to spatial similarity, based on the physical environment data. Cells in which no crime occurred during the analysis period are removed because they might negatively impact training. The remaining cells are then used for training. Crime data are imbalanced because there is less data on where a crime occurred than where it did not. Accordingly, resampling is used to solve the problems caused by the data imbalance. The preprocessed data are trained using a random forest, a tree-based machine learning algorithm, and the differences between the model with spatial clustering and the general model are compared.

**Figure 2.** Research flow.

#### **2. Theoretical Review**

Efforts have been made to prevent crime by identifying and mitigating its causes. Environmental criminology seeks to explain the causes of crime using the surrounding environment. The main theories employed are routine activity theory (RAT) [17] and crime pattern theory (CPT) [18]. RAT states that a crime occurs when a motivated offender

(a suitable target) and the absence of a capable guardian simultaneously intersect in time and space. According to RAT, it is important to view individuals as motivated offenders and minimize their opportunities to commit the crime. The CPT states that people move in certain patterns because of their physical or social environment, such as their occupation. During their routine activities, motivated offenders identify the characteristics of these areas and suitable targets for crime, choose a suitable time and place, and then carry out the crime. Hence, crimes do not occur randomly; however, they are concentrated in certain locations, owing to specific factors, and are influenced by the surrounding environment and living patterns of individuals and their neighbors.

Wolfgang, Figlio, and Sellin [19], and Sherman, Gartin, and Buerger [20] found that approximately 50% of all crimes every year occur within 4% to 5% of the total street and explained that crime is spatially concentrated. Studies have also shown that the areas surrounding the place where a crime occurs are at risk of identical crimes, and the more recent the crime, the greater its influence [9–11]. Notably, Bernasco [10] reported that, among the crimes that occurred within 100 m of an earlier crime, 90% that occurred within seven days involved the same offender. Moreover, the same offender was involved in 64% of the crimes that occurred within 90 days, and 13% within nine years. This indicates that offenders familiar with the surrounding area are more likely to commit other crimes in neighboring locations.

Facilities and land use play a key role in the relationship between crime and the physical environment. These factors provide the purpose of people's movements and have a close relationship with individuals' living patterns. Brantingham and Brantingham [11] analyzed the correlation between commercial theft and facilities. In their results, blocks with supermarkets and department stores showed similar crime rates to blocks without these landmarks, whereas blocks with fast-food restaurants, traditional restaurants, and pubs had 2- to 2.5-times more commercial theft than blocks without these landmarks. Lee, Yoon, and Kim [13] analyzed the causes of crime according to crime type in specific cities in Korea. Visitor accommodation, restaurants, financial institutions, and homes in non-residential buildings were highly correlated with theft crimes. In the case of CCTV, it was found that the related factors were different depending on the type of crime, such as showing a significant correlation only with rape and violence. Studies analyzing the impact of land use on crime are also underway. As in the case of facilities, there were differences in the factors affected by crime and, in the case of commercial areas, it was discovered to be related to most crimes [14–16]. Stucky and Ottensmann [15] analyzed the relationship between land use and crimes such as violent crime, homicide, robbery, aggravated assault, and rape. The correlation between land use and crime type was shown to be different, showing a significance in crime, homicide, and aggravated assault. Kwon, Kwon, and Jung [16] examined the correlation between each crime type and land use by clustering the theft crimes into detailed types according to the victim's gender, the time of the occurrence, and the place of the occurrence. It was shown that the associated physical environment was different. As such, crimes do not occur randomly, but have factors influencing them; in environments where crimes can occur with ease, it is important to identify these related factors.

#### **3. Data and Methodology**

#### *3.1. Research Area and Analysis Unit*

Dongjak-gu, the research area, is one of the administrative districts of Seoul, the capital of South Korea. Its population density is 24,190/km2, which is similar to that of Manhattan in the United States. Residential areas are high-density areas comprising 84% of the total population, and there are 8.5 cases of violent crime per 1000 people. Although this rate is ranked 17 out of the 25 administrative districts, it is rather high because most of the areas with high crime rates have a developed entertainment industry.

To effectively perform crime prevention activities using crime prediction, it is important to precisely set the analysis unit so that crime prevention resources can be allocated

to the appropriate locations. Accordingly, this study attempts to predict crime in the microscopic range through grid-level analysis. Compared to the administrative districts and census output areas, which are statistically spatial units used in existing statistical map services, grids have a uniform shape and size, allowing statistical information to be objectively examined. Moreover, the grids can be flexibly applied to changes in the map scale. This study used GIS to divide the target area into a grid of 100 × 100 m cells, then time- and space-related data were added to each cell to perform the analysis.

#### *3.2. Crime Data*

In the case of crime prediction, it is generally known that theft crimes are easier to predict than other types of crimes. Crimes such as murder and assault are highly influenced by ill feelings between the offender and the victim because the target is a specific individual. In contrast, since the target of theft is a specific building or object, it is influenced more by the surrounding environment and the behavioral characteristics of the criminal than by personal feelings [21]. The analysis of this study focuses on theft. With the cooperation of the police department with the relevant jurisdiction, data on incidents of theft in Dongjakgu from 2013 to 2017 were used in the analysis. Figure 3 shows the monthly distribution of theft in Dongjak-gu. During this period, an average of 95 thefts occurred per month; the most occurred in March 2013 (199), and the fewest occurred in November 2016 (31). The theft data include the date, time, method, and exact location of the crime. Inaccurate data (such as cases with incorrect addresses and duplicate reports on the same date) were excluded. A total of 8023 theft cases were used for the analysis. Based on prior studies showing that the more recent the crime, the greater its influence on future crimes, in order to train the influence over time in a predictive model, this study calculated the average number of crimes that occurred in each cell over the periods of two weeks, one month, three months, six months, and one year, and used these values for the training. In the grid-level analysis, cells in which crime never occurred were mainly those areas (such as mountains or water) in which it was difficult for crime to occur. These data can easily lead the model to predict that no crime incidents occur in areas with no previous record of crime, which may negatively impact the predictions [7,8]. Therefore, before training, the cells in which no crime occurred from 2013 to 2016 were removed. The 2017 crime record was excluded, as it was used as the test set.

**Figure 3.** Distribution of theft, sorted by month, in Dongjak-gu.

#### *3.3. Physical Environment Data*

The data used in this study were provided by the National Spatial Data Infrastructure Portal (http://www.nsdi.go.kr, 15 January 2020) and the Open Data Plaza (https://data. seoul.go.kr/, 15 January 2020). The data on building usage comprise basic information on location, size, etc., and are categorized into 152 types according to building use. However, using data on all buildings for the training may degrade the model's performance while consuming extensive computing resources because the model would be trained on data with an insufficient correlation with crime. Consequently, considering the training, this study applied data on building use related to restaurants, pubs, accommodations, banks, and residences, which have been demonstrated by previous studies as being related to crime. Restaurants were categorized into general restaurants, where patrons stay for a long time to eat, and rest-area restaurants, which sell simple meals such as fast food. Residential buildings were categorized into single-family housing, multi-family housing, and apartments, depending on the type of residence. Additionally, CCTV and streetlights, which are factors influencing natural surveillance, and bus stops, known to induce crime because of crowding, were added to the facility variables for the training.

Regarding the information on land characteristics, data on land usage and officially assessed land prices (OALP) were used for the training. In South Korea, land use is divided into eight categories: general commercial areas, neighboring commercial areas, circulating commercial areas, first-class residential districts, second-class residential districts, thirdclass residential districts, semi-residential areas, and natural green belt zones. The facilities and allowable sizes that can be built according to each usage area have different legal regulations. To apply the land usage data for the training, the area occupied by the usage category in each cell of the grid was converted to a percentage. In addition to land use, the average OALP of each cell was calculated and used as a variable. This is used as an indicator to identify the geographic continuity in the spatial clustering analysis. Table 1 lists the variables used in the study. Finally, applying crime data from 2014 to that of 2016, with the training set and crime data from 2017 as the test set, the data were used in training, and the model's performance was evaluated.


**Table 1.** Feature selection.

#### *3.4. Spatially Constrained Clustering Methods*

Clustering is a data-mining technique that classifies the given data into multiple clusters, based on the similarity of their attributes. Because it is difficult for general clustering techniques to reflect the spatial continuity of data in a vector space such as GIS, researchers have been studying spatially constrained clustering methods [21,22] to solve this issue. One of them is the max-p regions model [23,24]; unlike the general clustering techniques that classify data into a limited number of clusters, this model aims to maximize the number of clusters that satisfy the minimum threshold of the constraint, while minimizing spatial heterogeneity in each cluster. This constraint is the minimum

value of the variables (population size, number of houses, etc.) included in each instance, or the minimum number of instances that must be included in each cluster. To cluster the cells that are spatially similar and adjacent in distance, this study sets the number of cells that can be included in each cluster as the constraint. As a feature of the max-p regions model, a specific cluster can be prevented from growing excessively larger than the other clusters, and the land can be uniformly clustered while maintaining spatial continuity. Thus, the model can be effectively used for microscale analysis.

The equation of the max-p regions model is as follows: first, *A* = {*<sup>A</sup>*1, *A*2, ··· , *An*}, (*n* = |*A*|) is defined as the set for the entire land area, and *A* is defined as the set divided into *p* regions, *Pp* = *<sup>R</sup>*1, *R*2, ··· , *Rp*, (1 ≤ *p* ≤ *<sup>n</sup>*). In this study, *li* is the attribute that must at least reach the minimum threshold in area *Ai*.

$$\begin{cases} |R\_k| > 0 \text{ for } k = 1, 2, \dots, p \\ \begin{array}{c} R\_k \cap R\_{k'} = \emptyset \text{ for } k, k' = 1, 2, \dots, \text{ } \text{ } p \land k \neq k' \\ \cup \bigcup\_{k=1}^p R\_k = A \end{array} \end{cases} \tag{1}$$
  $\sum\_{A\_i \in R\_k} l\_i \ge \text{threshold} > 0 \text{ for } i = 1, 2, \dots, \text{ } n \text{ and } k = 1, 2, \dots, \text{ } p$ 

Here, all the divisible sets of *A* are defined as Π. Thereafter, the max-p algorithm can be defined as in Equation (2). *HPp* is the sum of the heterogeneity of space over all of *Pp* ∈ Π.

$$\begin{cases} \begin{array}{c} P\_p^\* = \max \left( \begin{vmatrix} P\_p^\* \end{vmatrix} : P\_p \in \Pi \right) \\\ \nexists P\_p \in \Pi \;:\; |P\_p| = \begin{vmatrix} P\_p^\* \end{vmatrix} \text{AND}\; H(P\_p) < H(P\_p^\*) \end{cases} \tag{2}$$

In this study, facility and land data were inserted into the grid-divided area and used as variables for the max-p regions model, through which cells with geographically similar characteristics were clustered. Based on this, in the machine learning step, crimes that occurred in the same cluster were used as a prediction variable to reflect the influence of crimes that occurred in the adjacent land during the training. Figure 4 shows an example of the max-p regions model, and Table 2 shows an example of average attributes for each cluster. To ensure that the cell to be predicted and cells that are physically far away do not belong to the same cluster, the number of cells *n* belonging to each cluster was set between 2 and 10.

**Figure 4.** Example of a max-p regions model (*n* = 4).


**Table 2.** Example of average attributes for each cluster in the max-p regions model.
