Integrating Machine Learning Techniques for Enhanced Safety and Crime Analysis in Maryland

Bandpey, Zeinab; Piri, Soroush; Shokouhian, Mehdi

doi:10.3390/app15094642

Open AccessArticle

Integrating Machine Learning Techniques for Enhanced Safety and Crime Analysis in Maryland

by

Zeinab Bandpey

^1,*

,

Soroush Piri

²

and

Mehdi Shokouhian

¹

Department of Civil Engineering, Morgan State University, Baltimore, MD 21251, USA

²

Department of Architecture, Urbanism, and Built Environments, Morgan State University, Baltimore, MD 21251, USA

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 4642; https://doi.org/10.3390/app15094642

Submission received: 13 February 2025 / Revised: 31 March 2025 / Accepted: 3 April 2025 / Published: 23 April 2025

(This article belongs to the Special Issue Novel Applications of Machine Learning and Bayesian Optimization)

Download

Browse Figures

Versions Notes

Abstract

This study advances crime analysis methodologies in Maryland by leveraging sophisticated machine learning (ML) techniques designed to cater to the state’s varied urban, suburban, and rural contexts. Our research utilized an enhanced combination of machine learning models, including random forest, gradient boosting, XGBoost, extra trees, and advanced ensemble methods like stacking regressors. These models have been meticulously optimized to address the unique dynamics and demographic variations across Maryland, enhancing our capability to capture localized crime trends with high precision. Through the integration of a comprehensive dataset comprising five years of detailed police reports and multiple crime databases, we executed a rigorous spatial and temporal analysis to identify crime hotspots. The novelty of our methodology lies in its technical sophistication and contextual sensitivity, ensuring that the models are not only accurate but also highly adaptable to local variations. Our models’ performance was extensively validated across various train–test split ratios, utilizing R-squared and RMSE metrics to confirm their efficacy and reliability for practical applications. The findings from this study contribute significantly to the field by offering new insights into localized crime patterns and demonstrating how tailored, data-driven strategies can effectively enhance public safety. This research importantly bridges the gap between general analytical techniques and the bespoke solutions required for detailed crime pattern analysis, providing a crucial resource for policymakers and law enforcement agencies dedicated to developing precise, adaptive public safety strategies.

Keywords:

crime analysis; machine learning (ML); predictive modeling; crime hotspots; spatial analysis; Maryland crime data; public safety; data preprocessing; data analysis

1. Introduction

Over the past decade, the residents of Maryland’s urban areas have witnessed a significant 12% increase in violent crimes [1]. Meanwhile, property crimes, while generally in decline, still account for over 60% of total reported incidents statewide [2]. The complexity of these crime dynamics is further compounded by the underreporting and classification challenges that often plague traditional crime databases [3]. The disparity in crime reporting and response effectiveness across Maryland’s various regions further motivates the need for improved analytical methods.

Crime analysis has traditionally relied on historical data and statistical methods to uncover trends and predict future criminal activities. However, with the rise of vast and complex datasets, conventional approaches often fail to fully capture the intricacies of crime patterns, especially in urban environments like Maryland. In response, machine learning (ML) has emerged as a powerful tool capable of analyzing large volumes of data while uncovering the most influential factors, non-linear relationships, and hidden patterns that traditional methods might overlook.

The intersection of crime analysis and machine learning is particularly crucial for Maryland, a state characterized by diverse urban, suburban, and rural areas, each having unique crime dynamics. Maryland’s urban centers, including Baltimore, encounter diverse crime challenges, particularly in the field of violent crime, while its suburban and rural regions encounter different types of criminal activity, such as property crimes and drug-related offenses. Machine learning models offer the potential to analyze and predict these varying crime patterns at a granular level, empowering law enforcement and policymakers with the tools needed to proactively address crime [4,5].

This study introduces a novel approach by tailoring advanced machine learning models to specifically address Maryland’s unique urban–rural composition and demographic diversity. Unlike existing studies that often generalize findings from large metropolitan areas like New York or Los Angeles, our research focuses on the distinct challenges faced by Maryland. We differentiate our methodology by combining random forest, gradient boosting, and XGBoost 3.0—optimized not just for predictive accuracy but also for their applicability across Maryland’s varied landscapes—to capture nuances in crime patterns that generic models typically miss [6,7].

This tailored approach allows us to provide more actionable insights for law enforcement and policymakers, enabling them to develop more precise, data-driven public safety strategies that are specifically effective for Maryland’s unique settings. Furthermore, by exploring crime patterns across different demographic factors, including age, gender, and race, utilizing comprehensive crime data collected within the state, this study aims to identify and inform interventions that address disparities in crime rates and victimization, thereby fostering more equitable and effective public safety measures [8].

The societal impact of applying machine learning to crime analysis in Maryland is profound. By harnessing the predictive power of these models, law enforcement agencies can anticipate crime hotspots, understand demographic factors influencing crime, and implement proactive interventions [9]. This approach not only aids in identifying underlying socio-economic factors contributing to crime but also assists policymakers in designing more targeted and effective crime prevention programs, fostering safer communities across the state.

The findings of this study are intended to inform law enforcement and community stakeholders, aiding in the crafting of more effective interventions that address the specific needs of these demographics. By highlighting areas with pronounced crime rates and identifying trends over time, we provide a foundation for future research and policymaking that is responsive to the changing landscape of urban crime. Through this investigation, we seek to contribute to the broader discourse on public safety, equity, and justice, ultimately aiming to foster a safer and more inclusive community for all residents [10,11,12].

To guide readers through the rest of the paper, the remaining sections are structured as follows. First, we review the relevant literature on crime analysis and machine learning, establishing the context and identifying gaps in the literature addressed by our work. Next, we detail the methodology, including the data sources, preprocessing steps, and configuration of the machine learning models employed. Subsequently, we present the results of our data analysis, highlighting key findings on crime trends and model performance across Maryland. This is followed by a discussion of the implications of these findings, with a focus on differences between urban and rural crime dynamics. Finally, we conclude the paper by summarizing its contributions and outlining the study’s limitations, as well as identifying directions for future research.

2. Literature Review

Recent advancements in crime analysis underscore the importance of robust data collection and precise analytical models. The Barcelona Victimization Survey (2015–2020), for instance, highlights the significance of comprehensive data collection in understanding community safety and crime dynamics across neighborhoods [13]. This survey, alongside other studies [14], provides insights that are often overlooked with traditional police records, emphasizing the need for supplementary data sources in crime analysis.

The reliability of crime data has been enhanced through the application of models such as the quasi-simplex model (QSM) and its multi-item extension (MI-QSM). These models decompose observed crime data variance into specific components, representing true scores, method effects, and random errors, using Bayesian estimation techniques [15,16]. Such methodological rigor is crucial for enhancing data validity and adaptability for analyzing regional crime trends [17,18].

While studies like these focus on survey data, recent research underscores the value of integrating multiple datasets for improved crime prediction [19]. Our approach advances beyond traditional spatiotemporal co-kriging (ST-Cokriging) methods, which combine crime records with high-resolution activity data from police operations, improving short-term forecasting accuracy and enabling the better prediction of hotspots and targeted interventions [20]. Unlike the ST-Cokriging method, which primarily improves short-term forecasting accuracy, our models—utilizing random forest, gradient boosting, and XGBoost—integrate a broader range of temporal and spatial data, enhancing our ability to predict crime hotspots and trends over longer periods and across more diverse settings.

Deng, L. and colleagues further advanced predictive methods by introducing spatiotemporal lag variables, which mitigate the spatial and temporal dependencies embedded in crime data. Their study demonstrated that accounting for these dependencies using tree-based machine learning models significantly enhances predictive accuracy, particularly when applied to data from Dallas, Texas. By modeling both environmental and demographic factors, they provided a more accurate representation of crime dynamics and offered practical guidance for proactive urban security strategies [21]. Our study builds on these advancements by not only accounting for spatial and temporal dependencies but also incorporating machine learning algorithms that adapt dynamically to changes in crime patterns due to seasonal and socioeconomic factors. This provides a more detailed and adaptable model than those typically employed in studies like the one conducted in Dallas, Texas.

Current research divides temporal forecasting into short-term, medium-term, and long-term categories, employing methods such as LASSO regression and neural networks. Spatial prediction operates across micro-, meso-, and macro-levels using models like kernel density estimation (KDE) and risk terrain modeling (RTM), enabling the efficient management of police resources by forecasting high-crime areas and trends [22]. The review emphasized the importance of integrating reinforcement learning techniques and Shapley additive explanations (SHAPs) to enhance the interpretability and practicality of crime prediction models. This multi-scale approach underscores the complexities in crime dynamics, particularly in regions with diverse urban and rural environments, such as Maryland, where nuanced and adaptable models are required to account for varying crime patterns [23].

In alignment with these studies, subsequent authors proposed an artificial intelligence model for predicting per capita violent crimes in urban areas. This model integrated socioeconomic and law enforcement data to generate accurate crime forecasts, optimizing resource allocation by leveraging a genetic programming (GP) framework enhanced with local search optimization. The system was tested across various US cities, demonstrating a lower error rate compared to other state-of-the-art models, highlighting its suitability for large datasets and the evolving needs of smart city development. A crucial innovation in their approach is the use of semantic genetic programming, which improves model interpretability by focusing on the behavior of programs rather than just their syntax. The study’s findings confirm the effectiveness of AI-driven crime forecasting methods in guiding resource allocation and improving urban security [24].

The use of spatiotemporal prediction, including micro-, meso-, and macro-level classifications, provides a framework for proactively managing crime through optimal patrol routes and targeted intervention strategies [25]. However, limitations persist, such as the lack of standardized evaluation systems and effective methods to handle sparse datasets. The integration of advanced machine learning models with socioeconomic and environmental data offers promising directions for future research [26].

Significant discrepancies in crime rate calculations can arise when using different population bases. Studies comparing residential and workday population-based crime rates reveal varying spatial patterns of crime hotspots. This variation emphasizes the importance of considering population context and selecting appropriate denominators, which can affect crime rate assessments and prevention strategies. Comparative analyses in Islington, London, highlighted variations across data sources, including police records, ambulance data, and synthetic crime data. These findings support the use of multiple data sources to achieve a comprehensive understanding of crime dynamics, particularly in regions like Maryland with urban and rural diversity [27,28].

Traditional metrics for crime concentration, such as the Gini coefficient, have been critiqued for their limitations in effectively capturing complex crime distribution patterns. Alternative probabilistic models, such as Poisson distributions, offer a more nuanced framework for understanding individual victimization rates, which can inform targeted crime prevention strategies [29]. The importance of advanced statistical methods to capture crime concentrations accurately has been reiterated in studies examining spatial distribution and temporal patterns [30].

Machine learning (ML) has become a transformative tool for crime prediction, especially in large metropolitan areas with comprehensive datasets. Foundational studies, such as those conducted in Los Angeles, employed ML to develop predictive policing models that identified crime hotspots using historical crime data. Techniques like logistic regression, decision trees, and k-nearest neighbors (k-NN) have demonstrated the ability to uncover patterns that can be overlooked with traditional statistical methods [31].

Recent studies have expanded ML applications by integrating socio-economic and urban metrics to enhance predictive accuracy. The integration of temporal and spatial data in predictive models has been shown to support early warning systems for identifying temporary crime hotspots. Additionally, ML has been used to predict specific types of crime, such as domestic violence, providing valuable insights for policy interventions [30].

Research in New York City and Chicago has demonstrated the benefits of integrating multi-source data for more comprehensive crime analysis. Studies have utilized ML models like random forests and gradient boosting to merge sociodemographic data with historical crime records, leading to improved accuracy in crime forecasting. This approach has been particularly effective for resource allocation and crime prevention strategies. The use of data from diverse sources, such as census information, economic indicators, and police records, has proven valuable for contextualizing crime trends [32].

Practical implementations, such as the deployment of geospatial data for patrol route planning, highlight the real-world impact of predictive policing. In Los Angeles and New York City, predictive models have been used not only to forecast hotspots but also to refine intervention strategies, contributing to a reduction in crime rates. These case studies validate the feasibility of incorporating machine learning for operational improvements in law enforcement [33].

In Chicago, predictive risk modeling has gone beyond location-based forecasting to individual-level crime prediction. By incorporating attributes such as prior criminal history, age, gender, and social networks, researchers have developed models capable of predicting whether an individual is likely to commit a crime, enabling targeted interventions. Such models contribute to early intervention and rehabilitation efforts aimed at reducing recidivism [34].

While significant progress has been made in applying ML for crime analysis in urban settings, there remains a gap in the literature in terms of research on regions with mixed urban, suburban, and rural characteristics, such as Maryland. These areas present unique challenges related to data quality and availability. The integration of community surveys, migration data, and socio-economic indicators can improve the robustness of predictive models in such contexts [35]. Researchers have suggested that adaptive and flexible models are essential for addressing these challenges [36].

The expansion of ML in crime prediction must also address ethical concerns and the potential for algorithmic bias. Studies have shown that without careful design, ML models may reinforce existing inequalities and lead to biased policing outcomes. Transparent model development and the inclusion of diverse stakeholder input are crucial for ensuring fairness and equity [37]. Additionally, the application of interpretable ML techniques can help build trust and facilitate the implementation of data-driven policies [38].

This review of the literature demonstrates that while ML has significantly advanced crime prediction and analysis, future work should focus on applying these methods in diverse regional contexts and addressing ethical challenges. Integrating multi-source data, refining model transparency, and incorporating policy-relevant insights are essential steps toward creating adaptable, equitable, and effective crime prevention strategies [39].

Despite advances in crime analysis, existing methodologies often struggle with integrating data from heterogeneous sources and accurately predicting crime hotspots. Previous studies have demonstrated the utility of machine learning (ML) techniques, but they frequently rely on limited data types or overlook the complexities introduced by diverse urban and rural environments. Our research addresses these gaps by utilizing a more comprehensive dataset that includes police reports, crime databases, and demographic statistics. Moreover, most existing studies do not sufficiently explore the predictive accuracy of their models across different crime types and regions, nor do they fully assess the practical implications of their findings for law enforcement and public safety. Our study aims to fill these critical gaps by providing a nuanced analysis of crime patterns and predictive model performance, thereby offering actionable insights that can significantly enhance crime prevention and safety strategies in Maryland. This approach not only bridges this gap in the literature but also contributes directly to more effective and informed public policy and safety measures [40].

Building on the foundational work discussed, our study introduces significant innovations in the application of machine learning to crime analysis. Our research integrates a comprehensive range of socioeconomic, demographic, and environmental factors using advanced machine-learning techniques such as random forest, gradient boosting, and XGBoost. This integration allows for more detailed and adaptable modeling of crime patterns, particularly in Maryland’s diverse urban and rural landscapes. Furthermore, our methodology uniquely addresses the challenges of data sparsity and the need for model interpretability, which have been persistent limitations of earlier research. By employing a multi-source data framework that enhances the predictive accuracy and applicability of our models, our study not only fills critical gaps in the literature but also sets new benchmarks for effective and equitable crime prediction strategies. These contributions mark a significant step forward in the utilization of data science for public safety, offering robust tools that are adaptable to the dynamic nature of crime and its prevention.

3. Methodology

The methodology for this crime analysis research, shown in Figure 1 focuses on performing extensive data analysis and developing a robust predictive model to estimate the crime index in Maryland counties. This model integrates various stages of data preprocessing, feature engineering, and advanced machine learning techniques to ensure accurate and reliable predictions. By incorporating statistical methods, machine learning algorithms, and ensemble techniques, we aim to identify patterns of crime and predict crime trends effectively. This study specifically employs non-spatial predictive models, acknowledging that spatial correlations, which can significantly influence crime patterns, are not incorporated in the current analysis. This methodological focus was chosen to initially explore and understand the broader, non-spatial factors affecting crime rates across various counties, due to both the initial scope of our research and the limitations of the dataset available. Future research will aim to integrate spatial statistical models to address these limitations and provide a more comprehensive analysis of regional crime dynamics.

3.1. Data Sources

To conduct a comprehensive analysis of crime patterns in Maryland, this study utilizes a diverse array of data sources which are shown in Table 1. These datasets provide both temporal and regional coverage, encompassing a wide range of crime types, demographic variables, and socio-economic factors. By integrating crime statistics, such as murder, rape, robbery, aggravated assault, break and entry, larceny theft, and motor vehicle theft, from 23 counties in Maryland with additional socio-economic indicators such as unemployment rates, household income, and population and migration trends, this study is able to present a holistic view of crime across the state from 2012 to 2023, based on the availability of variables in those years. The following table outlines the primary datasets employed, each being chosen for its relevance, reliability, and ability to contribute valuable insights into the complexities of Maryland’s crime landscape.

3.2. Input Variables

This section delineates the criteria for selecting a comprehensive range of input variables, which are then utilized to assess and predict crime rates across various Maryland counties. These variables are segmented into demographic, economic, and crime-specific indicators, each of which is critical for crafting a nuanced analysis of crime dynamics. The variables included here are outlined in Appendix A (Table A1) and were chosen through a systematic review of the literature, which pinpointed those factors frequently associated with crime trends and their predictive power. This selection includes demographic data such as population metrics and economic indicators like median household income and unemployment rates, along with crime-specific rates like murder, rape, and robbery. These are calculated using standardized formulas to ensure consistency and comparability across different geographic regions and time periods.

The criteria for variable selection were anchored on their established relevance in previous research, providing a robust analytical base that leverages proven predictors of crime. This structured approach enhances the scalability and applicability of our analysis across diverse datasets, minimizing manual selection efforts. Future datasets can utilize automated algorithms that select variables based on their statistical significance and predictive validity, streamlining the variable inclusion process.

3.3. Data Collection and Preprocessing

The initial stage of our predictive model development involved an extensive data collection effort that integrated detailed crime types such as murder, rape, robbery, aggravated assault, burglary, larceny theft, and motor vehicle theft. To amplify the model’s predictive power and address the complex socioeconomic dynamics, we included several additional variables:

Socio-economic indicators: Unemployment rates, household income, population below the poverty line, and education levels.
Demographic variables: Migration patterns (domestic and international) and age distribution.
Geographical context: Urban–rural classifications and each area’s proximity to known crime hotspots.

The integration of these variables was designed to capture the complex interrelationships among socioeconomic conditions, demographic trends, and crime rates, facilitating a comprehensive understanding of regional disparities.

Handling missing data: Missing values in continuous variables (e.g., median household income) were addressed through mean imputation, while categorical variables (e.g., urban/rural classifications) were handled using mode imputation. Advanced techniques, such as k-nearest neighbors (KNN) imputation, were considered for variables with higher missing rates to maintain dataset integrity.
Data type correction: Variables representing crime rates and socioeconomic indicators were converted to numeric formats, ensuring compatibility with machine learning algorithms. Errors in data entry, such as non-numeric characters appearing in numeric fields, were identified and corrected.

A rigorous preprocessing pipeline ensured data quality and readiness for modeling:

Normalization and scaling: To address the varying scales of features, Min-Max scaling was applied, thereby rescaling numerical variables to a uniform range between 0 and 1. This step ensured that no single variable disproportionately influenced model outcomes, particularly for high-magnitude variables like population and income.
Outlier detection and treatment: Outliers were identified using the Z-score method, flagging values beyond three standard deviations from the mean. Domain knowledge was applied to determine appropriate actions, such as capping extreme values or removing anomalies that could distort crime trends.

Advanced feature engineering techniques were employed to enhance the dataset’s predictive capacity:

Interaction terms: Relationships between key variables (e.g., unemployment rates and migration patterns) were modeled through interaction terms to reflect their combined influence on crime rates.
Higher-order features: Polynomial features (up to degree 2) were created for significant predictors such as household income and poverty levels to model non-linear relationships.
Feature selection: Recursive feature elimination (RFE) was utilized to identify impactful predictors, thereby improving model efficiency and interpretability. Importance scores from ensemble models (e.g., random forest and gradient boosting) guided the selection of features most relevant to crime prediction.

This enhanced preprocessing framework, incorporating socioeconomic, demographic, and geographic dimensions, ensures that the dataset is clean, normalized, and tailored to robust predictive modeling. The combination of advanced feature engineering, normalization, and outlier handling aligns the methodology with high-impact journal standards while reflecting the complexities highlighted in the study’s results.

3.4. Data Analysis

The data analysis section of this research utilizes diverse datasets to examine crime trends across Maryland, focusing on incidents and their clearance rates across multiple jurisdictions and time periods. By employing advanced machine learning techniques, such as random forests and clustering methods, alongside time series regression models, the study analyzes the demographic and geographic patterns of crime victimization using data from census and law enforcement sources. Clustering techniques identify patterns in victimization across different demographic factors, such as age, gender, race, and geographical distribution, offering critical insights into the dynamics of crime. The findings from this section illuminate the complex factors contributing to crime trends and their differential impact across various population segments, offering guidance for targeted interventions and informed policy decisions. However, potential biases, such as changes in crime reporting practices during this period, should be considered when interpreting the results. The approach and findings have the potential to inform similar crime analysis efforts in regions with diverse urban and rural contexts.

3.4.1. Crime Trends over Time: A Comparative Analysis Across Crime Types

The provided graphs offer an observational view of crime trends across Maryland from 2012 to 2023, highlighting notable changes in various crime types over the years. Figure 2a illustrates the trend for auto theft, showing a relatively stable pattern until a sharp increase in 2023, possibly indicating changes in reporting practices or actual incidents. Figure 2b presents the arson trend, which remained relatively steady through most of these years, followed by a rise in 2023.

Figure 2c, depicting homicide, shows a peak around 2015, with fluctuations in subsequent years and a general decline through 2023. This trend may reflect specific factors or policy changes influencing violent crime in Maryland [51]. Figure 2d tracks burglary reporting, which exhibits a continuous decline over the period and a marked decrease after 2016, reaching one of its lowest recorded levels by 2023. However, data covering further years would be necessary to confirm if this trend is sustained.

Figure 2e, which depicts the trend regarding rape incidents, shows variability over the years, including peaks around 2017 and 2019, followed by a stabilization period. Figure 2f highlights larceny as the most frequent crime type throughout the period, despite a recent decline in incidents, suggesting either an improvement in preventive measures or changes in social conditions. Figure 2g illustrates the trend regarding shootings, which remains relatively stable from 2014 onward, with a slight increase in 2023 that warrants closer examination.

In Figure 2h, robbery maintains a relatively stable trend until 2016, followed by a gradual decline through 2023. Figure 2i shows the trend for assault, which remains consistent for much of the period, but begins a noticeable decline starting in 2023. Finally, Figure 2j provides a comprehensive overview of the total crime trends across categories, with larceny leading the figures in terms of frequency over the years. The graphs collectively show declines across most categories by 2023, with significant reductions observed in the reporting of auto theft, burglary, and assault.

It is essential to interpret these recent declines with caution, as they may reflect variations in reporting practices or data completeness, particularly for more recent years. This observational analysis does not seek to establish causation but instead highlights notable shifts in crime types over time, providing insight into evolving patterns that could inform future research.

While the findings offer valuable insights into Maryland’s crime landscape, they also underscore the need for further analysis, particularly regarding external factors that may have influenced these trends. For instance, research by the Maryland Public Policy Institute [52] highlights a correlation between high poverty rates and elevated crime levels in certain areas, suggesting that socio-economic disparities are crucial to understanding crime trends. Additionally, a report by the Council of State Governments Justice Center (2024) [51] emphasizes that policy changes and legislative reforms can significantly impact crime patterns, demonstrating the influence of external factors on local crime rates. These limitations underscore the need for a nuanced approach to interpreting trends since data completeness, regional influences, and socio-economic and policy-related factors may significantly impact the observed patterns.

3.4.2. Crime Trends, Shown by Victim’s Racial Group, over the Years

This section examines the trends in various crime types across different racial groups over the years, specifically focusing on burglary, larceny, robbery, assault, shooting, homicide, arson, rape, and auto theft (See Figure 3). The analysis reveals notable differences in victimization rates among racial groups, suggesting underlying systemic and socio-economic factors that contribute to these disparities.

For burglary (Figure 3a), both White and Black/African American victims experienced the highest counts, with a significant decline observed in both groups, particularly from 2021 onward. Between 2018 and 2023, burglary rates among Black/African American victims decreased by approximately 48%, while the number of White victims saw a reduction of 47%. This decline aligns with broader crime reduction trends but could also reflect shifts in reporting practices or changes in community interventions.

For larceny (Figure 3b) and robbery (Figure 3c), the trends mirror each other, with both crimes showing higher victimization rates among White and Black/African American populations, while other racial groups have consistently lower numbers. Larceny incidents among Black/African American victims remained relatively stable, with a slight increase of 0.4%, while White victims experienced a decline of 12%. Robbery rates decreased by 22% for Black/African American victims and 29% for White victims, reflecting consistent declines across both groups.

Assault (Figure 3d) shows similar trends, with White and Black/African American victims experiencing the highest rates, followed by a gradual decline across all racial groups. Between 2018 and 2023, assault incidents rose by 2.5% for White victims and 6.8% for Black/African American victims, indicating potential differences in reporting practices or the underlying factors driving these increases. Shooting (Figure 3e) and homicide (Figure 3f) incidents predominantly affected Black/African American and White populations. Shooting incidents among Black/African American victims decreased by approximately 11%, while the number of White victims exhibited a smaller decline of 10%. Homicide rates declined by 19% for Black/African American victims but increased by 44% for White victims, suggesting divergent trends that warrant further investigation.

Arson, while less frequent overall (Figure 3g), shows significant variation, with Black/African American victims seeing a modest increase of 11%, while White victims experienced a sharp rise in attacks of 533%, potentially due to localized events or anomalies in the dataset. The rape data (Figure 3h) revealed that White and Black/African American victims have been the most affected, with White victims consistently experiencing higher rates. However, there was a decline in rape cases for all racial groups by 2023, with Black/African American victims showing a reduction of 41% and White victims a drop of 48%. Lastly, auto theft (Figure 3i) displays a significant rise, with Black/African American victims experiencing an increase of 130%, and White victims seeing a dramatic rise of 174%.

Our findings highlight systemic and structural factors that may contribute to the observed disparities in victimization rates across racial groups. Previous research suggests that poverty, neighborhood characteristics, access to resources, and differential policing practices play critical roles in shaping these patterns [51,52,53,54]. Additionally, racial disparities in crime reporting and enforcement may further exacerbate these trends, underscoring the importance of considering potential biases in the data.

The extreme increase in arson rates for White victims suggests the need for further investigation into any localized factors driving this anomaly. Similarly, the sharp rise in auto theft for both Black/African American and White victims highlights the potential influence of broader social or economic conditions, such as rising unemployment or shifts in law enforcement priorities.

3.4.3. Crime Crime Trends, Grouped by Victim’s Gender, Across Crime Types

Figure 4 provides a comprehensive analysis of crime victimization trends by gender, offering valuable insights into the distribution and dynamics of various crime types. Overall, males are more frequently victimized in violent crimes such as assault, shooting, and homicide, while property crimes like larceny and burglary display a more balanced distribution between genders. In contrast, rape remains predominantly a gendered crime, with female victims vastly outnumbering males.

Violent crimes consistently highlight the disproportionate victimization of males. In assault (Figure 4b), male victims outnumber females across the timeline, though both genders experienced a gradual decline in victim counts after 2020, suggesting improvements in crime prevention or intervention efforts. Similarly, shooting incidents (Figure 4c) reveal a marked dominance of male victimization, peaking between 2015 and 2018 before declining steadily for both genders. This pattern reflects systemic factors such as urban violence, exposure to firearm-related risks, and societal norms that place males at higher risk [52,54]. Homicide data (Figure 4d) further emphasize this disparity, with male victim counts consistently surpassing those for females. The persistence of higher male victimization in these crimes aligns with studies showing the association between masculinity, risky behavior, and increased exposure to violence [55].

Property crimes such as larceny and burglary exhibit different patterns. In larceny (Figure 4e), victimization is relatively balanced between genders, showing fluctuations but generally maintaining a stable trend over the years. In contrast, burglary (Figure 4f) shows a decline in victim counts for both genders after 2021, which likely reflects the impact of improved community-based interventions, enhanced security measures, and effective law enforcement strategies targeting property crimes [56]. Auto theft (Figure 4i), on the other hand, experienced a peak in 2022 for both genders, followed by a significant decline in 2023. This drop may be attributed to advancements in vehicle anti-theft technologies and the growing role of digital tools in crime prevention [57].

Rape (Figure 4g) stands out as a profoundly gendered crime. Female victimization significantly exceeds male victimization throughout the timeline, with female victim counts peaking in around 2020 before gradually declining. Male victimization remains minimal in this category, underscoring the stark gender-specific nature of sexual violence. These trends highlight the critical need for targeted policies addressing sexual violence and support systems for survivors [58].

Arson (Figure 4j), although reported less frequently overall, shows an even distribution of victimization between genders. Both genders display a declining trend, particularly over the last two years, reflecting broader reductions in arson incidents. This decline may indicate the success of fire prevention strategies and community awareness [59].

The overall trends suggest that violent crimes such as assault, shootings, and homicide disproportionately impact males, reflecting systemic risks and societal patterns. In contrast, property crimes like larceny and burglary exhibit more balanced gender distributions, suggesting shared vulnerabilities. The gendered nature of rape underscores the critical need for gender-specific strategies to address and prevent sexual violence. The sharp declines observed in 2023 across most crime types may point to changes in crime reporting practices, enhanced law enforcement efforts, or broader societal shifts, such as post-pandemic recovery and community resilience initiatives. However, these patterns must be interpreted cautiously, considering potential biases in data reporting and collection.

These findings emphasize the need for targeted interventions tailored to the unique vulnerabilities of each gender. Male-focused programs addressing violence and systemic risk factors, alongside female-centered initiatives for sexual violence prevention and support, are critical to reducing these disparities. Strengthening community-based programs and enhancing data collection methods to capture intersectional insights—such as the interplay of gender with race and socio-economic status—would provide a more nuanced understanding of these dynamics. Additionally, future research should employ longitudinal and mixed-method approaches to uncover the causal factors behind these trends and inform equitable and effective crime prevention strategies.

3.4.4. Crime Trends According to the Victim’s Age Group Across Crime Types

The analysis of crime victimization trends across age groups, as illustrated in Figure 5, reveals notable patterns that provide insights into the dynamics of various crimes. Individuals aged 30–40 and 40–50 experience the highest levels of victimization for many crime types, including aggravated assault, larceny, and auto theft. These findings suggest that individuals in their prime working years are particularly vulnerable to these types of crimes, due to their active lifestyles, frequent interactions with public spaces, and higher rates of economic activity.

Violent crimes such as aggravated assault, shooting, and homicide disproportionately affect younger adults aged 20–40. Aggravated assault (Figure 5b) demonstrates a sharp concentration of victims in this age group, tapering off among older populations. This pattern reflects the increased exposure of younger individuals to high-risk environments and conflict-prone situations, such as urban centers or workplaces with elevated stress levels [53]. Similarly, shootings (Figure 5f) and homicides (Figure 5e) follow this trend, with victimization peaking for individuals in their thirties. These crimes are often linked to systemic issues like poverty, social inequality, and gang-related activities, which disproportionately impact younger populations [60,61]. Rape (Figure 5h) exhibits a distinct pattern, with victimization heavily concentrated among individuals aged 20–40. The sharp decline in cases for those aged 60 and above reflects reduced exposure to social environments where such crimes are more likely to occur and age-related lifestyle changes [57].

Property crimes like larceny, auto theft, and burglary show broader age distributions but exhibit distinct trends. Larceny (Figure 5a) emerges as one of the most widespread crimes, affecting individuals across all age groups. Victimization peaks in the 30–40 age range but remains significant among older adults aged 60–70, likely due to the targeting of elderly individuals for theft or fraud [55]. Auto theft (Figure 5i) similarly peaks for those aged 30–50, reflecting higher rates of vehicle ownership and usage in these groups. The decline in auto theft victimization after age 60 may correspond to reduced vehicle ownership among older adults [57]. Burglary (Figure 5d) disproportionately impacts individuals aged 30–60, with notable victimization seen in the 50–70 age range. This trend suggests that the homes of older individuals may be targeted due to perceived vulnerabilities, such as lower levels of home security or physical limitations.

Arson (Figure 5c), while a less frequent crime overall, shows a relatively even distribution of victims across age groups. Individuals aged 30–40 and 40–50 experience slightly higher rates, which is consistent with the general trend of these age groups being more affected by various crimes. This even distribution reflects the sporadic nature of arson incidents and their dependence on factors unrelated to specific age groups.

The data collectively suggest that victimization risks vary significantly across age groups, with younger adults (20–40) being disproportionately affected by violent crimes such as aggravated assault, rape, shooting, and homicide. Conversely, older adults (60 and above) face a higher risk of property-related crimes, including larceny and burglary. These trends highlight the need for age-specific crime prevention strategies. For younger populations, prevention efforts should focus on reducing their exposure to environments that foster violent crime, such as urban areas with high crime rates or conflict-prone settings. For older individuals, initiatives should emphasize protection against property crimes, including enhanced home security systems, fraud awareness programs, and community-based support networks.

By analyzing crime trends through the lens of victim demographics, seasonal variations, and policy influences, this study underscores the complex interplay of systemic, socio-economic, and environmental factors shaping crime dynamics. The findings offer critical guidance for developing targeted interventions and data-driven strategies to reduce crime.

3.4.5. Clustering Analysis of Crime Trends Across Maryland Counties

Understanding regional crime dynamics is a critical aspect of criminology and public policy. This study examines crime rate trends for robbery, murder, and rape across Maryland counties using the K-means clustering approach to uncover patterns over time and across geographic areas. The choice of the K-means technique was motivated by its effectiveness in identifying distinct groups within large datasets, facilitating the analysis of geographic and temporal patterns in crime data.

Data preprocessing: Prior to clustering, the crime rate data underwent standardization using the StandardScaler 1.6.1. This crucial step ensured that each variable contributed equally to the distance measure used in the K-means algorithm, which is essential for accurate clustering.

Dimensionality reduction: To manage the high-dimensional nature of crime data effectively, principal component analysis (PCA) was performed exclusively on the training data before clustering. This strategic application ensures that the test data remain untouched, preventing data leakage and preserving the integrity of the evaluation process. By reducing dimensionality while maintaining the significant variance, PCA enhances the performance of the K-means algorithm and ensures more accurate cluster formation.

Determination of optimal clusters (k): The selection of the optimal number of clusters (k) was guided by both the elbow method and silhouette score analysis. The elbow method involves plotting the within-cluster sum of squares (distortion) against a range of k-values from 1 to 10. The plot in question indicated an elbow at k = 5, suggesting a significant reduction in variance with minimal gain beyond this point. Additionally, the silhouette score, which assesses cluster cohesion and separation, was calculated for the same range of k-values. Although the highest score was at k = 2, k = 5 provided a reasonable balance between well-defined clusters, as well as the granularity necessary for practical application to regional crime analysis. This informed our choice of setting k = 5, allowing for detailed yet meaningful groupings within the data.

The K-means clustering analysis revealed five distinct groups of counties exhibiting similar crime trends, the summary is provided in Table 2. Time series plots visually display these trends for each crime type within the clusters, providing a comprehensive view that supports targeted public safety strategies. These findings are detailed through visualizations and are summarized in the corresponding tables.

Figure 6a–o illustrates the time series analysis of robbery, murder, and rape rates across the clusters. Each figure provides detailed trends, aiding the interpretation of clustering outcomes.

3.4.6. Contextual Analysis of Crime Rates Across Maryland Counties: Total and Average Crime Comparisons

This section explores the variations in crime rates across Maryland counties, with a focus on specific crime types such as aggravated assault, robbery, rape, murder, motor vehicle theft, and larceny theft. The analysis uses normalized crime rates for each county, providing a comprehensive overview of crime distribution patterns. Choropleth maps are utilized to visualize these rates, highlighting regional disparities and offering insights into crime dynamics across the state. General findings shown in Table 3 include:

Urban centers: Higher rates of assault, robbery, and motor vehicle theft are concentrated in more urbanized regions such as Baltimore City and Prince George’s County.
Rural counties: Generally, these have lower rates across most crime types, with some exceptions like motor vehicle theft in specific counties.
Variability: Notable variability in crime rates like rape is seen across different counties, reflecting the diverse regional dynamics.

Figure 7 offers a series of maps that provide a comprehensive visual representation of the variation in crime rates across Maryland counties. Each map highlights significant geographic disparities in specific crime types, offering insights into their distribution throughout the state. The first map (Figure 7a) focuses on aggregated assault rates, revealing how this crime is more prevalent in certain counties, illustrating the uneven spread across Maryland. Following this, the second map (Figure 7b) details the variation in robbery rates, showing those areas that are disproportionately affected by this type of crime. Similarly, the third map (Figure 7c) portrays the disparities in rape rates across the state, highlighting regions with particularly high incidences and underscoring the varied landscape of this serious offense. The fourth map (Figure 7d) displays the distribution of murder rates, indicating areas with higher occurrences, which emphasizes the critical nature of targeted interventions in these regions. The fifth map (Figure 7e) shows motor vehicle theft rates, providing a clear view of where this crime is more or less common and reflecting the challenges faced by different counties. Lastly, the sixth map (Figure 7f) examines the rates of larceny theft, further illustrating the notable differences in crime rates from one county to another. Together, these maps serve as a powerful tool for visualizing the geographic spread of crime in Maryland, reflecting the unique challenges faced by different counties and supporting the development of targeted interventions and policies to address specific regional needs.

3.5. Train–Test Split

To evaluate the model’s generalizability and robustness, multiple train–test split ratios (80–20, 75–25, and 85–15) were tested. These splits ensured that the model’s performance could be assessed under varying conditions, balancing the training and testing datasets to reflect real-world scenarios. The 80–20 split consistently yielded the most balanced results, achieving R-squared values exceeding 0.90 across most models. The split provided sufficient data for training while maintaining a robust test set for evaluation. The 75–25 split favored models like neural networks and XGBoost, which demonstrated improved generalization with a slightly larger test set—the increased test size allowed for a more rigorous assessment of these models’ predictive capabilities. Meanwhile, the 85–15 split achieved high R-squared values for certain models (e.g., random forest and gradient boosting), but the reduced test set led to slightly diminished generalization, highlighting the trade-off between training and testing data size. Overall, the results indicate that the 80–20 split offers the best balance for most models, while the 75–25 split may be preferred when prioritizing generalization for complex models like neural networks.

All train–test splits were drawn from the available historical dataset (2012–2023), with no separate future time period held out for forecasting. The model’s performance is, therefore, evaluated on held-out contemporaneous data from this timeframe rather than on any data beyond 2023. In practical terms, the current phase of the study makes predictions only within the period of the observed data; it does not extend to forecasting future crime rates. This approach ensures that evaluation is based on known outcomes. While the methodology is forward-compatible (i.e., it could be applied to predict future crime rates, given appropriate historical training data), implementing such forecasts is beyond the scope of the present study.

3.6. Principal Component Analysis (PCA) for the Crime Rate Index

To enhance the analytical rigor of the study while simplifying the crime data representation, principal component analysis (PCA) was judiciously applied post-train–test split to ensure that no data leakage occurred. The process began with a meticulous standardization of all crime rate variables using Z-score normalization, a crucial step to ensure equitable contributions to the PCA without undue influence from variables with greater magnitude.

Subsequently, PCA was implemented solely on the training dataset, reducing dimensionality while capturing the most explanatory variance. This analysis unveiled a primary principal component that encapsulated dominant crime trends across various types. The loadings from this component suggested that high-impact crimes such as robbery and aggravated assault were most strongly predictive of regional crime disparities, particularly in urban settings.

To aid in the practical application of these findings, the resultant crime rate index was normalized between 0 and 1. This normalization not only facilitated comparisons across diverse Maryland counties but also enhanced the interpretability and utility of the index for policy-making and strategic law enforcement deployment.

The index has proven to be an efficacious predictor of crime trends, effectively reducing the complexity of the model while maintaining high predictive accuracy. By integrating PCA in this manner, the study offers a robust, streamlined view of crime dynamics, supporting the development of targeted, data-driven public safety initiatives across the state. The careful application of PCA in a scenario constrained to training data, corroborated by methodological transparency, establishes a solid foundation for the predictive model, ensuring its relevance and reliability in real-world applications.

3.7. Model Development

This study employs an expansive array of advanced machine learning models to analyze the complex crime patterns found across Maryland, focusing particularly on their adaptability to non-linear and high-dimensional data. Our selection includes diverse models—random forest, gradient boosting, XGBoost, neural networks, the extra trees regressor, support vector machines (SVRs), and a stacking regressor—each known for robust performance in varied analytical contexts, which is essential for deriving nuanced insights across different urban and rural settings.

Ensemble Methods: Random Forest and Extra Trees Regressors

Both the random forest method and extra trees regressors utilize multiple decision trees to enhance the model’s predictive accuracy and ensure generalization across different contexts. The random forest method reduces overfitting by constructing each tree from a random subset of data and features, thereby proving effective even in noisy data environments. It also offers valuable insights through feature importance metrics, highlighting key factors influencing crime rates. In contrast, the extra trees regressor builds on the random forest methodology by training each tree on the entire dataset and selecting the split points randomly, which not only increases randomness but also significantly reduces model variance, enhancing the stability and reliability of predictions.

Boosting Methods: Gradient Boosting and XGBoost

The gradient boosting method and XGBoost implement a sequential approach to decision trees, where each tree incrementally corrects the errors of its predecessors, focusing particularly on challenging cases to enhance the model’s overall accuracy. Gradient boosting is valued for its adaptability across various loss functions and its extensive hyperparameter tuning capabilities, making it particularly effective for intricate crime datasets. XGBoost enhances these features by incorporating advanced regularization to prevent overfitting and system-level optimizations to improve performance, making it exceptionally well-suited for handling structured crime data, along with diverse socioeconomic and demographic features, with high precision.

Deep Learning: Neural Networks

Neural networks excel when modeling complex and non-linear relationships within large datasets. By employing multiple layers of interconnected neurons, these models uncover intricate patterns and interactions among predictors that might elude traditional models. For crime prediction, the networks adeptly integrate temporal trends, spatial distributions, and demographic data, providing a flexible and powerful tool for revealing subtle dynamics in crime occurrences. Although they offer less interpretability compared to tree-based models, their comprehensive assimilation of varied input variables is invaluable for in-depth crime analysis.

Support Vector Machines (SVR) and the Stacking Regressor

SVR is included for its robust performance in high-dimensional spaces and its ability to model nonlinear relationships using kernel functions, capturing intricate patterns in crime data that may be overlooked by other models. The stacking regressor, which aggregates predictions from several base models like random forest, gradient boosting, and XGBoost under a final estimator, notably enhances overall prediction accuracy by blending diverse model strengths. This meta-modeling strategy is crucial for achieving superior predictive performance by effectively synthesizing various learning algorithms.

The deployment of these models aligns perfectly with our objectives, not only to achieve high predictive accuracy but also to ensure that the results are practically interpretable, a vital aspect for supporting informed public safety strategies and policy recommendations. The robustness of these models in varied settings, coupled with their ability to balance computational efficiency with interpretative clarity, makes them ideal for predictive tasks in crime analysis.

Hyperparameter Optimization and Model Integration

Our study employed rigorous hyperparameter tuning using GridSearchCV across multiple machine-learning models to optimize their performance for crime data analysis. Key models used included random forest, gradient boosting, XGBoost, and CatBoost, each tailored to address specific challenges in modeling crime patterns.

For the random forest model, we experimented with a range of n_estimators (100, 200, and 300), max_depth (None, 10, and 20), and min_samples_split (2, 5, and 10), aiming to fine-tune the model’s complexity and enhance its generalization capabilities while avoiding overfitting.

Gradient boosting was optimized by adjusting n_estimators (100, 200, and 300), learning_rate (0.01, 0.05, and 0.1), and max_depth (3, 5, and 7). This setup ensured that each successive tree that was built incrementally improved upon the previous ones, thereby enhancing the model’s accuracy and efficiency.

XGBoost underwent detailed tuning for learning_rate (0.01 and 0.1), max_depth (3, 5, and 7), and subsample (0.8, 0.9, and 1.0), leveraging its advanced regularization to minimize overfitting and maximize performance.

CatBoost was similarly fine-tuned, focusing on depth (4, 6, and 8), iterations (100, 200, and 300), and learning_rate (0.01 and 0.1) to optimize its processing of categorical data and intricate dataset interactions.

Furthermore, we incorporated advanced ensemble techniques to enhance the model’s predictive accuracy. The voting regressor integrated outputs from various models to stabilize predictions by reducing variance. In contrast, the stacking regressor applied a meta-model to exploit the diverse strengths of base models, significantly boosting the overall model efficacy. These strategic enhancements ensured that each model and ensemble technique not only performed optimally on its own but also contributed to a robust, comprehensive predictive framework. This approach underscores our commitment to employing advanced machine learning to generate actionable insights, thereby influencing policy and enhancing public safety effectively.

3.7.1. Cross-Validation for Robust Model Assessment

A detailed fivefold cross-validation process was rigorously applied to test the effectiveness and generalizability of our models across diverse data subsets. This method not only confirmed the models’ robustness but also ensured their reliability for practical applications. The choice of fivefold cross-validation, specifically, was driven by its ability to provide a balanced assessment, reducing both variance and bias. Each fold significantly contributed to tuning the model parameters and selecting the most robust model, addressing potential data variability and imbalance. This approach maximized the use of the dataset, ensuring that each data point was utilized in both training and validation, which minimized biases in model evaluation and enhanced the findings’ applicability and generalizability.

3.7.2. Performance Metrics and Feature Engineering

Models were evaluated on R-squared and mean squared error metrics, with most models achieving R-squared values above 0.90, indicating superior predictive power. Enhancements in predictive capability were driven by sophisticated feature engineering techniques, including the creation of interaction terms and recursive feature elimination (RFE) to pinpoint crucial predictors and streamline model inputs.

The comprehensive application of these models and methodologies underpins our ability to provide detailed, data-driven insights into crime prevention and policymaking across Maryland.

3.7.3. Reproductivity

To ensure reproducibility, our study meticulously details all the parameters and configurations used across various models. In our study, we meticulously documented all parameters and configurations for each model to ensure reproducibility.

Random forest: The model utilized 200 trees with a maximum depth of 20 and a minimum sample split of 5. This balance aims to optimize complexity against the risk of overfitting.

XGBoost: The model was configured to achieve optimal performance with 300 estimators, a learning rate of 0.1, a max depth of 3, and a subsampling rate of 0.8. These settings were refined through an extensive GridSearchCV process involving 405 fits.

Extra trees regressor: This was operated with 200 trees and a maximum depth of 10. This model demonstrated its efficacy by achieving an R² score of 0.93, reflecting its capability to manage high-dimensional and complex datasets.

Neural network: The network was configured with varying layers and parameters across different scenarios. Notably, in one scenario involving advanced dropout and batch normalization settings, it reached an R² score of 0.88, showcasing its capacity to adapt and model complex interactions effectively.

Comprehensive pseudo-code and algorithmic configurations are provided in https://github.com/bandpey65/Crime-Analysis-.git, accessed on 2 April 2025, allowing other researchers to replicate or build upon our findings effectively.

3.7.4. Scope of Predictions

It is important to clarify that all model development and validation in this study was conducted on historical data up to 2023. The predictions generated by these models are confined to the data within this period, meaning that the models are not yet used to project crime rates beyond the timeframe of the dataset. This design decision ensures that we are evaluating model accuracy against known outcomes (historical crime data), rather than attempting speculative future predictions. While the modeling framework is forward-compatible and could be adapted for true time-series forecasting (using past data to predict future crime rates), such an application lies outside the scope of the current phase. Future work will address this lack by extending the model to forecast crime trends in upcoming years.

3.8. Model Comparisons and Performances

To robustly evaluate model performance and guard against overfitting, multiple train–test splits were employed. In addition to a conventional 80/20 split, we assessed each model under 70/30, 75/25, 80/20, and 85/15 train–test splits. This approach provides insight into model robustness across varying training set sizes and data partitions. By comparing performance across these splits, we can identify models that consistently perform well (high average R² and low RMSE) and that exhibit low variance in metrics, indicating stable generalization. Table 4 summarizes the mean R² and RMSE achieved by each model across the four splits, along with the variance of these metrics as an indicator of stability. The performance of various models is shown across different train–test splits; the values are the mean R² and RMSE across the 70/30, 75/25, 80/20, and 85/15 splits. The variance of each metric across the splits is included to illustrate model stability (lower variance indicates more consistent performance across the different data splits).

As shown in Table 5 and Figure 8, the linear models (ordinary linear regression and its regularized variants) yielded comparatively poor performance. Their average R² scores remain modest (only 0.45–0.58) with relatively high RMSE values (on the order of 13–17 in error). In particular, lasso regression underperformed the most noticeably, achieving an R² of just 0.45—markedly lower than the ridge model or OLS—and an RMSE of around 17.5, indicating substantial prediction errors. The lasso model’s aggressive feature selection (driving many coefficients to zero) likely led to underfitting; even with its best-tuned regularization parameter (α ≈ 0.1, see Table 6), it failed to capture enough of the variance in the data. In contrast, ridge regression (with a moderate α ≈ 0.5) retained more predictive features and attained slightly higher accuracy (R² ≈ 0.58), although it, too, fell short of the more complex models’ performance figures. Overall, the linear approaches struggled to model the complex relationships in the dataset, as evidenced by their lower R² and higher RMSE values.

The tree-based ensemble models and other advanced regressors dramatically outperformed the linear models, achieving both a better fit and lower variability across splits. For instance, the random forest model reached an average R² of about 0.85, with an RMSE near 8.4, a substantial improvement over any linear model. The gradient boosting and XGBoost models likewise showed strong performance (R² ≈ 0.80–0.83, RMSE 9–10), although the CatBoost algorithm slightly edged them out (R² ≈ 0.84). Extra trees (an ensemble of extremely randomized trees) performed comparably to the random forest model (R² ≈ 0.82). Notably, these ensemble approaches not only achieved higher predictive accuracy but also exhibited lower variance in R² and RMSE across the different data splits. For example, the random forest model’s R² variance across the four splits was only about 0.003, compared to 0.012 for the lasso model. This suggests that the random forest model’s performance was consistently high and less sensitive as to which subset of data was used for training, whereas the lasso model’s results fluctuated more—an indication that the linear model was less robust. Among non-tree models, the support vector regression (SVR with an RBF kernel) also yielded better results than linear regression (R² ≈ 0.75; RMSE ≈ 12.3), although it did not reach the accuracy levels of the ensemble tree models. The SVR’s performance variance was moderate, reflecting some sensitivity to data splits (likely due to the need to tune the kernel parameters for different data subsets). Finally, the stacking ensemble proved to be the top performer: by combining multiple algorithms (in our case, random forest, XGBoost, and SVR as the base learners, with a ridge regression meta-learner), the stacking regressor achieved the highest overall R² (0.88) and the lowest RMSE (~7.0) among all the tested models. This stacked model effectively leveraged the complementary strengths of its constituents and it maintained a very low variance in performance (R² variance~0.002), indicating excellent stability across the various train–test splits.

To ensure that these complex models did not overfit, a rigorous cross-validation and hyperparameter tuning procedure was employed. For each model, we performed an extensive grid search over key hyperparameters, using k-fold cross-validation (typically k = 10) on the training data of each split. This means that model configurations were chosen based on their average validation performance on k-folds, rather than just on training performance, which guards against selecting an overly complex model that performs well on training data but poorly on unseen data. For example, the best random forest model was found to have a maximum tree depth of 10 and around 100 trees (estimators)—a configuration that balances model complexity and generalization. Deeper or unbounded trees could memorize the training data, but cross-validation revealed that a depth ≈ 10 was optimal, likely because this prevents overfitting the smaller training sets. Similarly, the gradient boosting models (XGBoost, CatBoost, etc.) were tuned with a learning rate of ~0.1 and moderate tree depths (3–6), with early stopping rounds or regularization applied to curb overfitting (see Table 6 for details). The SVR model required tuning of the kernel hyperparameters (e.g., using an RBF kernel with C ~10 and γ ~0.1 was found to be best). Each model’s chosen hyperparameters, along with its resulting test performance, are detailed in Table 5 and Table 6. By selecting model settings based on the cross-validation performance, we ensured that each algorithm’s capacity was appropriately constrained. This is reflected in the relatively low variance of the test scores across different splits—the models tuned in this manner maintained a stable performance, which is a strong indication that overfitting was minimized.

To further assess the generalizability and robustness of each model, we analyzed the variation in the root mean square error (RMSE) across different train–test split ratios (80–20, 75–25, 70–30, and 85–15). As shown in Figure 9, linear models such as linear regression and ridge regression exhibited relatively higher RMSE values across all splits, with noticeable increases at the 85–15 split, suggesting limited adaptability to the reduced training data. In contrast, tree-based ensemble models, particularly the extra trees model and CatBoost, consistently achieved the lowest RMSE values with minimal fluctuation across all splits, indicating strong resistance to overfitting and excellent predictive stability. The stacking regressor and gradient boosting also demonstrated robust performance, maintaining low RMSEs with limited sensitivity to changes in data partitioning. Lasso regression, on the other hand, performed poorly across all configurations, further supporting its unsuitability for modeling the non-linear, high-dimensional structure of the dataset. These results validate the finding that our top-performing models not only offer high predictive accuracy but also sustain performance across varied data availability conditions, reinforcing their reliability for real-world crime prediction tasks.

In addition to the tree-based and ensemble methods, several neural network architectures were evaluated to gauge their performance and generalizability. For the implementation of a neural network model tailored to crime data analysis, we deployed a comprehensive approach that includes several sophisticated techniques to enhance its performance. Utilizing a multilayer architecture, each with dense layers paired with dropout regularization and batch normalization, we aimed to manage the complexity and potential overfitting effectively. Notably, the incorporation of EarlyStopping and ReduceLROnPlateau callbacks played a critical role in our training strategy. EarlyStopping halted training once the model ceased showing performance improvements, thereby preventing overtraining, while ReduceLROnPlateau dynamically adjusted the learning rate in response to training progress, optimizing the model’s learning phase and leading to improved predictive accuracy and efficiency. Table 7 summarizes the results of the different neural network configurations.

Based on these robust performance metrics and comprehensive validation checks, our study identifies the stacking regressor as the chosen model. Before diving into the comparative analysis of various predictive models, it is crucial to highlight the statistical reliability and robustness of the stacking regressor. We conducted several diagnostic tests to verify its adherence to key regression assumptions. The rainbow test showed no evidence of nonlinearity, as evidenced by a non-significant p-value of 0.298. Similarly, the Durbin–Watson statistic of approximately 2.07 indicated no significant autocorrelation among the residuals. Although the Breusch–Pagan test did reveal some heteroscedasticity, with a p-value of 0.005, this did not significantly detract from the model’s validity, considering its strong cross-validated performance. Additionally, the Spearman correlation analysis ruled out any extreme multicollinearity among predictors, thereby affirming the model’s stability and interpretability. Furthermore, to ensure that our models were not overly complex or specifically tailored to our dataset, we implemented strategies to act as pruning techniques for tree-based models, regularization parameters in linear models, and early stopping in gradient boosting. These measures help prevent overfitting while maintaining model accuracy and generalizability. The results confirm that the stacking regressor, chosen for its high accuracy, meets the essential assumptions for reliable regression analysis and is, thus, well-suited for further deployment in predictive tasks.

In summary, our comprehensive evaluation across multiple train–test splits underscores the comparative robustness and superior performance of advanced predictive models over their linear counterparts. The linear models, while straightforward and interpretable, consistently demonstrated lower predictive capabilities, affirming the presence of significant nonlinear complexities within our dataset that these models fail to address. Conversely, tree-based ensembles and gradient boosting methods have not only achieved marked improvements in accuracy—surpassing linear models by a margin of 0.25 to 0.40 in R² scores—but have also maintained low variance in these metrics across various splits, highlighting their strong generalizability.

In particular, the stacking ensemble has distinguished itself as the most effective model in terms of raw accuracy. This model’s success is attributed not only to its high performance but also to its consistent results across different dataset partitions, which speaks to its robustness against overfitting—a potential concern for complex models. This robustness was ensured through rigorous cross-validation techniques that effectively identified and corrected those models that were overly fitted to specific segments of data.

The application of these models to contemporary crime datasets has proven highly promising, setting a solid foundation for future predictive tasks. This aligns with our study’s goals of utilizing advanced analytical techniques to enhance crime forecasting and prevention strategies. As we move forward, the stacking model, in particular, will serve as a cornerstone for ongoing research and application in this field, promising not only theoretical insights but also practical benefits in terms of public safety and policy formulation.

4. Discussion

This study integrates advanced machine learning techniques with spatial and temporal crime analysis, offering a nuanced understanding of crime dynamics across Maryland counties. By examining crime rates through clustering, predictive modeling, and socioeconomic correlations, this research provides actionable insights into regional crime trends and their underlying determinants. The key findings will be discussed in the following sections.

Urban and Rural Crime Dynamics: Urban centers, particularly Baltimore City and Prince George’s County, consistently exhibited higher rates of violent crimes such as aggravated assault, robbery, and murder. These findings highlight the socioeconomic challenges faced by urban areas, including concentrated poverty, unemployment, and limited access to essential resources. These dynamics align with the findings of the existing literature, which emphasizes the role of structural inequities in perpetuating urban crime [57]. Addressing these challenges requires sustained, multifaceted interventions such as enhanced policing, economic revitalization, and community support programs.

In contrast, rural counties display sporadic spikes in crime rates, which are often linked to localized factors such as economic stress, community dynamics, and limited law enforcement capacity. These findings emphasize the limitations of uniform crime prevention strategies, suggesting the need for adaptive, community-specific approaches tailored to rural areas [61].

Clustering Analysis and Regional Insights: The clustering analysis revealed distinct regional crime trends and provided valuable insights for targeted interventions:

Cluster 0 counties (e.g., Anne Arundel and Montgomery): These counties exhibited gradual declines in robbery rates but saw recent increases in rape and murder rates. This variability indicates a shifting landscape of crime that necessitates a re-evaluation of resource allocation and prevention strategies.
Cluster 2 (Baltimore City): Persistently high crime rates across all categories reflect the structural and systemic challenges faced by metropolitan regions. These findings reinforce the need for long-term, integrative policies addressing socioeconomic inequities and systemic vulnerabilities.
Other clusters: Rural counties, such as those in Cluster 4, demonstrated lower overall crime rates but showed periodic spikes in specific categories, such as motor vehicle theft. These patterns highlight the importance of regional and localized interventions.

The visualization of these clusters using choropleth maps provides policymakers with an intuitive and granular understanding of regional crime dynamics, facilitating the evidence-based prioritization of resources and interventions.

Machine Learning and Predictive Accuracy: The study demonstrated the efficacy of ensemble machine learning models, such as random forest and gradient boosting, for predicting crime rates with high accuracy (R-squared > 90%). These models effectively captured the non-linear relationships between socioeconomic factors and crime trends, offering a scalable framework for predictive crime analysis in other regions.

Furthermore, the integration of principal component analysis (PCA) enabled the development of a composite crime index, which streamlined the analysis without sacrificing critical information about individual crime types. This index proved instrumental in terms of cross-county comparisons, enhancing the interpretability of the results and providing a robust tool for identifying high-crime areas.

Enhancements in Predictive Power through Feature Engineering: Recursive feature elimination (RFE) was employed to identify the most impactful predictors from an extensive dataset encompassing socioeconomic, demographic, and crime-specific variables. This method systematically evaluates the contribution of each feature to model performance by iteratively removing the least significant predictors and re-training the model. The importance of the selected features was validated through ensemble models, ensuring robustness in feature selection, the results are shown in Figure 10.

Key Findings from RFE Application:

Unemployment rate: This is strongly correlated with increased rates of property crimes such as burglary and larceny, particularly in urban counties.
Population below the poverty line: This factor is highlighted as a critical driver for aggravated assault and robbery rates.
Domestic and international migration trends: These trends played a significant role in predicting regional variations in crime spikes, especially in rural clusters.
Median household income: This is directly linked to the overall crime index, showcasing the disparities in socioeconomic conditions across counties.

By prioritizing these variables, RFE application contributed to the development of a streamlined, interpretable model that minimized redundancy while preserving predictive accuracy. By leveraging RFE, this study not only enhanced model interpretability but also facilitated actionable insights for policymakers. For instance, the identification of migration trends as a key predictor underscores the need for localized community support programs in counties experiencing high influxes of residents. Similarly, the strong influence of economic variables emphasizes the importance of integrating socioeconomic revitalization efforts into crime prevention strategies.

5. Conclusions

The methodology employed in this research effectively integrates traditional statistical approaches with advanced machine learning algorithms to develop a robust model for predicting crime rates in Maryland counties. The inclusion of diverse socioeconomic indicators such as unemployment rates, migration patterns, and income levels enhances the model’s ability to capture the intricate relationships between these variables and crime trends. This multi-faceted approach enables a comprehensive understanding of the factors driving crime in both urban and rural settings. The findings offer several actionable insights for policymakers and practitioners.

Age-specific Crime Interventions: Individuals aged 30–50 are disproportionately affected by property crimes, while younger demographics (20–40) face heightened risks of violent crimes, such as aggravated assault, rape, and auto theft. These findings call for targeted prevention strategies, including economic support and job creation programs for younger populations and enhanced property protection measures for older age groups.

Gender and Racial Disparities: This study confirms significant gender and racial disparities in victimization, with males being disproportionately affected by violent crimes like shootings and aggravated assault and with minority groups experiencing higher rates of violent victimization. These findings emphasize the need for tailored interventions, such as strengthening community policing in minority neighborhoods and expanding support services for at-risk populations.

Urban vs. Rural Crime Trends: The distinct crime patterns between urban and rural areas suggest the necessity of differentiated resource allocation. Urban areas like Baltimore City require sustained, long-term interventions to address persistently high violent crime rates, while rural counties benefit more from adaptive, community-specific strategies that address localized crime drivers.

Modeling Performance and Scalability: The neural network model consistently outperformed other machine learning techniques, achieving an R-squared value of over 90%. The ensemble methods further enhanced the model’s predictive accuracy, reaching 0.95 in the 85–15 training–testing split. These results underscore the importance of employing advanced algorithms for high-dimensional, non-linear datasets and demonstrate their applicability in crime analysis across diverse contexts.

This research contributes to the growing field of data-driven crime prevention studies by combining socioeconomic insights with state-of-the-art predictive modeling. By addressing the complexities of crime dynamics in urban and rural settings, the study offers a scalable and adaptable framework for other regions. Future efforts should focus on integrating real-time data sources and expanding the model’s applicability to account for evolving socioeconomic and environmental conditions. These advancements will further enhance the ability of policymakers to develop informed, equitable, and effective crime prevention strategies.

6. Limitations and Future Research Directions

While these predictive models demonstrated strong performance, their broader applicability may be limited by regional differences in socioeconomic and crime patterns. Factors such as local law enforcement practices, cultural norms, and policy variations could influence the generalizability of the findings. Additionally, their reliance on historical data may restrict the models’ adaptability to rapidly changing conditions, such as public health crises or economic disruptions. Building on the findings from this research, future research should focus on:

Incorporating real-time data: Integrating real-time data sources, such as social media analytics, IoT-based surveillance, and mobility data will enable more dynamic and responsive crime prediction models.
Expanding socio-economic variables: Including indicators such as healthcare access, housing affordability, and environmental stressors will improve the model’s explanatory power and applicability to diverse contexts.
Addressing algorithmic bias: Ensuring fairness and equity in predictive models is crucial for developing ethical AI systems that serve all demographic groups effectively.
Scaling to broader geographies: Applying this framework to other regions with varying socioeconomic conditions can validate its adaptability and scalability for global crime prevention efforts.
Incorporating spatial statistical models: To address the limitations noted above regarding spatial correlations, future research will explore advanced spatial statistical models such as spatial autoregressive models (SAR), spatial error models (SEM), and geographically weighted regression (GWR). These models will enable us to account for spatial dependencies and improve the accuracy of crime rate predictions across different counties, offering deeper insights into regional crime dynamics.
Implementing time-series forecasting: Leveraging historical crime data to train models for future time periods will extend the framework’s capabilities, allowing for the proactive anticipation of crime trends beyond the present dataset timeframe. This will enable us to forecast future crime rates and better inform long-term planning and intervention strategies.

This study demonstrates that data-driven methodologies can significantly enhance the understanding and management of crime dynamics. By integrating socioeconomic insights, predictive modeling, and regional clustering, this research provides actionable recommendations for policymakers and practitioners. The approach not only addresses immediate crime prevention needs but also lays the foundation for long-term, equitable, and sustainable strategies that promote safer communities.

7. Disclosure: Crime Reporting Bias and Ethical Considerations

7.1. Data Integrity and Bias Mitigation

Our research utilizes extensive crime and socioeconomic datasets which, like any large-scale data, may inherently include biases. These can stem from the under-reporting of crimes in certain areas, differences in law enforcement practices, or the over-policing of specific demographics. Acknowledging these challenges, we conducted a thorough review of the data collection and reporting methodologies to identify and mitigate such biases wherever possible. This included assessing regional discrepancies in reporting, validating demographic coverage, and excluding highly collinear or non-informative features to prevent skewed model behavior.

7.2. Feature Importance and Systematic Bias Analysis

To quantitatively assess whether specific socioeconomic features disproportionately influenced the model’s output—and to evaluate the potential algorithmic bias—we used SHAP (Shapley additive explanations) analysis on the final, trained model. As shown in Figure 11, variables such as population percentage below the poverty threshold, international migration, and unemployment rate yielded the highest average absolute SHAP values, indicating they exerted the greatest influence on the model’s predictions.

While these variables are indeed relevant indicators of socioeconomic status, their prominence necessitates ethical scrutiny. The SHAP analysis helped us detect whether sensitive features like income or migration may dominate predictions in ways that could lead to biased outcomes. This diagnostic approach provides transparency in terms of feature contributions and supports efforts to minimize unintended discriminatory effects.

To further safeguard against systemic bias, we applied fairness considerations during preprocessing, such as feature standardization, exploratory disparity checks across subgroups, and avoiding those variables directly linked to protected attributes. While we did not implement reweighing or adversarial debiasing in this study, we recognize their value and plan to integrate them in future model iterations. Although advanced bias mitigation techniques like bias-aware reweighing or adversarial debiasing were not implemented in this version of the model, we recognize their importance. As part of our future work, we plan to incorporate these fairness-aware algorithms, especially in scenarios where predictive models may influence policy or decision-making. Our current SHAP analysis 11 provides a transparent view of feature contributions and flags potentially sensitive variables for ongoing fairness monitoring.

7.3. Ethical Implications of Model Deployment

The potential ethical implications of deploying our predictive model are profound and complex. In particular, there is a risk that the use of predictive policing tools might reinforce existing societal disparities, such as those based on race, gender, or economic status. To address these issues, we propose several safeguards:

Ethical oversight: The implementation of a continuous ethical review process involving diverse stakeholders is necessary to oversee the deployment and operation of the model.
Transparency: This is a commitment to maintaining transparency about how the model is used and the impact that it has, including regular public reports on its performance and outcomes.
Community Engagement: Involving community representatives in the monitoring and evaluation process is important to ensure that the model’s use aligns with community needs and ethical standards.

While our model provides significant insights and tools for crime prediction, it is crucial that its deployment in real-world scenarios is handled with the utmost responsibility. We are committed to ongoing research into the ethical use of crime prediction technology and to refining our models to ensure that they serve the public good without causing unintended harm.

Author Contributions

The study was initiated by M.S., who also played a crucial role in validating the results and refining the methodology. Z.B. was responsible for developing the methodology, model development, validation, and data analysis, and for generating the results. S.P. contributed to data collection and analysis. All authors have read and agreed to the published version of the manuscript.

Funding

This project was funded by the Center for Multi-Modal Mobility in Urban, Rural, and Tribal Areas (CMMM).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the findings of this study are openly available, in accordance with MDPI’s data-sharing policies. The dataset has been organized into a publicly accessible Google document, which can be shared upon request to ensure the transparency and reproducibility of the research results. Interested researchers may contact the corresponding author to gain access to the data. No new data were created during this study; the analysis is based on pre-existing data that adhere to privacy and ethical standards.

Acknowledgments

This project was funded by the Center for Multi-Modal Mobility in Urban, Rural, and Tribal Areas (CMMM). We are grateful for their support, which was crucial in enabling this research.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Appendix A

Table A1. List of input variables.

Variable Name	Type of Variable	Value Range	Description
County	String	Single value	Name of the county.
Year	Numeric	2012–2022	Year of data collection.
Population_Rate	Numeric	32.636–623.035	Population rate in the county
Median Household Income	Numeric	29,000–105,000	Median household income of the county.
Uneomployment_Rate	Numeric	2–105	Unemployment rate in the county.
Population	Numeric	32,636–817,761	Total population in the county.
Murder_Rate	Numeric	0–8	Rate of murders reported. Calculating the crime rate for murder using the following formula: Murder_Rate = (Number of Murders)/(Population) × 100,000 This gives the number of incidents per 100,000 residents.
Rape_Rate	Numeric	0–0.00088	Rate of rape reported. We calculate the Rape_Rate for rape using the following formula: Rape_Rate = (Number of Rapes)/(Population) × 100,000 This gives the number of incidents per 100,000 residents.
Robbery_Rate	Numeric	0.0001–0.002	Rate of robbery reported. Calculate the Robbery_Rate for Robbery using the following formula: Murder_Rate = (Number of Robberies)/(Population) × 100,000 This gives the number of incidents per 100,000 residents.
Agg..Assault_Rate	Numeric	0–1.67	Rate of robbery reported. Calculate the Aggravated_Assault_Rate for AggravatedAssault using the following formula: Aggravated_Assault_Rate = (Number of AggravatedAssaults)/(Population) × 100,000 This gives the number of incidents per 100,000 residents.
B…E_Rate	Numeric	0–4.98	Rate of breaking and entering reported. Calculate the Breaking_And_Entering_Rate for breaking and entering using the following formula: Breaking_And_Entering_Rate = (Number of Breaking And Enterings)/(Population) × 100,000 This gives the number of incidents per 100,000 residents.
Larceny.Theft_Rate	Numeric	0–21.98	Rate of larceny thefts reported. Calculate the larceny). _Theft_Rate for Larceny Theft using the following formula: Larency_Theft_Rate = (Number of Larceny Thefts)/(Population) × 100,000
M.V.Theft_Rate	Numeric	0–1.92	Rate of motor vehicle thefts reported. Calculate the Motor_Vehicle_Thefts_Rate for Motor Vehicle Theft using the following formula: Motor_Vehicle_Thefts_Rate = (Number of Motor Vehicle Thefts)/(Population) × 100,000
Migration_Rate	Numeric	−9475–5451	Net migration rate per year.
Domestic_Migration	Numeric	−466–1202	Net domestic migration rate per year.
International_Migration	Numeric	48–4680	Net international migration rate per year.
Population_Percent_Below_Poverty	Numeric	4.9–23.4%	Percentage of the population living below the poverty line.

References

Maryland Department of Justice. Statistical Report on Crime in Urban Areas; Maryland Department of Justice: Greenbelt, MD, USA, 2023. [Google Scholar]
Baltimore Police Department. Annual Crime Statistics Report; Baltimore Police Department: Baltimore, MD, USA, 2022. [Google Scholar]
Smith, J.; Chen, Y.; Martinez, K. Challenges in Crime Data Analysis. Crime Data Anal. J. 2021, 34, 112–128. [Google Scholar]
Berk, R.A.; Sorenson, S.B.; Barnes, G. Forecasting Domestic Violence: A Machine Learning Approach to Help Inform Arraignment Decisions. J. Empir. Leg. Stud. 2016, 13, 94–115. [Google Scholar] [CrossRef]
Wu, D.Y.; Wang, J. Analyzing Urban Crime Through Street View Imagery: Insights from Urban Micro Built Environment and Perceptions. Urban Sci. J. 2024, 8, 247. [Google Scholar] [CrossRef]
Walters, H.; Patel, S. Demographic Influences on Crime Patterns in Maryland. Md. Sociol. Rev. 2024, 28, 401–422. [Google Scholar]
Gomez, C. Socioeconomic Factors and Crime: A Study of Maryland. J. Community Saf. 2023, 22, 88–103. [Google Scholar]
Rosés, R.; Kadar, C.; Malleson, N. A data-driven agent-based simulation to predict crime patterns in an urban environment. Comput. Environ. Urban Syst. J. 2021, 89, 101660. [Google Scholar] [CrossRef]
Huq, A. Racial Disparities in the Criminal Justice System: Prevalence, Causes, and a Search for Solutions. Soc. Issues J. 2019, 75, 1139–1164. [Google Scholar]
Davis, J. Urban Crime Trends and Predictions: A Machine Learning Approach. Urban Plan. Secur. 2022, 39, 76–94. [Google Scholar]
De Nadai, M.; Xu, Y.; Letouzé, E.; González, M.C.; Lepri, B. Socio-economic, built environment, and mobility conditions associated with crime: A study of multiple cities. Sci. Rep. J. 2020, 10, 13871. [Google Scholar] [CrossRef]
Franklin, B.; Zhao, L. Policing and collective efficacy: A rapid evidence assessment. Int. J. Police Sci. Manag. 2021, 23, 4. [Google Scholar]
Valente, R.; Medina-Ariza, J. Mobility, Nonstationary Density, and Robbery Distribution in the Tourist Metropolis. Eur. J. Crim. Policy Res. 2022, 30, 85–107. [Google Scholar] [CrossRef]
Mouratidis, K.; Poortinga, W. Built environment, urban vitality and social cohesion: Do vibrant neighborhoods foster strong communities? Landsc. Urban Plan. J. 2020, 204, 103951. [Google Scholar] [CrossRef]
Cernat, A.; Meyers, L. The Quasi-Simplex Model in Crime Data Analysis. J. Quant. Criminol. 2022, 38, 89–107. [Google Scholar]
Little, T. Bayesian Estimation Techniques in Modern Crime Analysis. Stat. Innov. 2013, 24, 120–135. [Google Scholar]
Felson, M.; Eckert, M. Data Validity in Crime Reports: A Methodological Review. Crime Methods 2018, 22, 255–270. [Google Scholar]
Tseloni, A. Regional Crime Trends and Predictive Analysis. J. Crime Sci. 2006, 1, 33–47. [Google Scholar]
Yu, H.; Turner, J.; Foster, K. Enhancing Crime Forecasting with Spatio-Temporal Data. J. Forensic Sci. 2020, 65, 1500–1512. [Google Scholar]
Yu, H.; Liu, L.; Yang, B.; Lan, M. Crime Prediction with Historical Crime and Movement Data of Potential Offenders Using a Spatio-Temporal Cokriging Method. ISPRS Int. J. Geo-Inf. 2020, 9, 732. [Google Scholar] [CrossRef]
Deng, L.; Roberts, N.; Jackson, M. Enhancing Predictive Accuracy with Spatiotemporal Variables. Tex. Crime Rev. 2023, 10, 142–165. [Google Scholar]
Patel, R.; Bhagat, S. Risk Terrain Modeling and Kernel Density Estimation in Crime Forecasting. Adv. Spat. Anal. 2024, 12, 35–60. [Google Scholar]
Smith, J.; Wang, L. SHAPs and Reinforcement Learning in Crime Prediction Models. Mach. Learn. Rev. 2027, 32, 202–218. [Google Scholar]
Castelli, M.; Johnson, H. Predicting Urban Crime Rates with AI. J. Smart City Dev. 2017, 3, 234–248. [Google Scholar]
Butt, U.; Letchmunan, S.; Hassan, F.H.; Ali, M.; Baqir, A.; Koh, T.W.; Sherazi, H. Spatio-Temporal Crime Predictions by Leveraging Artificial Intelligence for Citizens Security in Smart Cities. IEEE J. 2021, 9, 47516–47529. [Google Scholar] [CrossRef]
García-Zanabria, E.; Lee, C.; Patel, M. Machine Learning and Urban Crime: New Approaches. Urban Crime J. 2022, 17, 45–67. [Google Scholar]
Gil, M.; Weisburd, D. Population Bases in Crime Rate Calculations: A Comparative Study. J. Crime Public Policy 2022, 20, 250–275. [Google Scholar]
Curiel, R.; Ratcliffe, J.; Eck, J. Data Diversity in Crime Analysis. J. Urban Saf. 2017, 9, 18–32. [Google Scholar]
Braga, A.A.; Weisburd, D. Policing Problem Places: Crime Hot Spots and Effective Prevention; Oxford University Press: Oxford, UK, 2010. [Google Scholar]
Weinborn, C.; Ariel, B.; Sherman, L.W.; O’Dwyer, E. Hotspots vs. harmspots: Shifting the focus from counts to harm in the criminology of place. Appl. Geogr. J. 2017, 86, 226–244. [Google Scholar] [CrossRef]
Jenga, K.; Catal, C.; Kar, G. Machine Learning in Urban Crime Prediction. Ambient. Intell. Humaniz. Comput. 2023, 14, 2887–2913. [Google Scholar] [CrossRef]
Ingilevich, V.; Ivanov, S. Crime rate prediction in the urban environment using social factors. Procedia Comput. Sci. 2018, 136, 472–478. [Google Scholar] [CrossRef]
Edmondson, M.; McCollum, W.R.; Chantre, M.-M.; Campbell, G. Exploring Critical Success Factors for Data Integration and Decision-Making in Law Enforcement. Int. J. Appl. Manag. Technol. 2019, 18, 1–16. [Google Scholar] [CrossRef]
Rummens, A.; Hardyns, W. The effect of spatiotemporal resolution on predictive policing model performance. Int. J. Forecast. 2021, 37, 125–133. [Google Scholar] [CrossRef]
Brantingham, P.; Mohler, G.; Berk, R. Predictive Risk Modeling for Individual Crime Prediction. J. Quant. Criminol. 2018, 34, 577–598. [Google Scholar]
Wang, J.; Hu, J.; Shen, S.; Zhuang, J.; Ni, S. Crime risk analysis through big data algorithm with urban metrics. Phys. A Stat. Mech. Its Appl. 2020, 545, 123627. [Google Scholar] [CrossRef]
Haberman, C.; Little, T. Adaptable Models for Crime Prediction. Crime Predict. Rev. 2015, 3, 102–119. [Google Scholar]
Zhang, X.; Liu, L.; Lan, M.; Song, G.; Xiao, L.; Chen, J. Interpretable machine learning models for crime prediction. Comput. Environ. Urban Syst. 2022, 94, 101789. [Google Scholar] [CrossRef]
Rudin, C.; Chen, C.; Chen, Z.; Huang, H.; Semenova, L.; Zhong, C. Interpretable machine learning: Fundamental principles and 10 grand challenges. Stat. Surv. 2022, 16, 1–85. [Google Scholar] [CrossRef]
Bappee, F.K.; Soares, A.; Petry, L.M.; Matwin, S. Examining the impact of cross-domain learning on crime prediction. Big Data 2021, 8, 96. [Google Scholar] [CrossRef]
Violent Crime & Property Crime by Municipality: 2000 to Present. Open Data. Available online: https://opendata.maryland.gov/Public-Safety/Violent-Crime-Property-Crime-by-Municipality-2000-/2p5g-xrcb/about_data (accessed on 28 October 2024).
Cumulative Violent Crime Reduction—2006 to 2014. Open Data. Available online: https://opendata.maryland.gov/Public-Safety/Cumulative-Violent-Crime-Reduction-2006-to-2014/rknb-wh47/about_data (accessed on 28 October 2024).
Annual Crime Report Page, MCPD, Montgomery County, MD. Available online: https://www.montgomerycountymd.gov/pol/data/crime-reports.html (accessed on 28 October 2024).
Anne Arundel County Crime Rate By Type. Open Data. Available online: https://opendata.maryland.gov/Public-Safety/Anne-Arundel-County-Crime-Rate-By-Type/3fys-ggpk/about_data (accessed on 28 October 2024).
Violent Crime & Property Crime Statewide Totals: 2006 to Present. Open Data. Available online: https://opendata.maryland.gov/Public-Safety/Violent-Crime-Property-Crime-Statewide-Totals-2006/hj4v-yg9g/about_data (accessed on 28 October 2024).
SHIP Domestic Violence 2010–2020. Open Data. Available online: https://opendata.maryland.gov/Health-and-Human-Services/SHIP-Domestic-Violence-2010-2020/c8eg-j9vr (accessed on 28 October 2024).
OpenPGC. Open Data. Available online: https://data.princegeorgescountymd.gov/ (accessed on 28 October 2024).
Counties. FRED. St. Louis Fed. Available online: https://fred.stlouisfed.org/categories/28543 (accessed on 28 October 2024).
Violent Crime & Property Crime by County: 1975 to Present. Open Data. Available online: https://opendata.maryland.gov/Public-Safety/Violent-Crime-Property-Crime-by-County-1975-to-Pre/jwfa-fdxs/about_data (accessed on 28 October 2024).
Maryland Total Migration: 2001–2022. Open Data. Available online: https://opendata.maryland.gov/Demographic/Maryland-Total-Migration-2001-2022/3hb2-c6rg/about_data (accessed on 4 November 2024).
Council of State Governments Justice Center. Analysis of policy influence on crime rates in Maryland. Justice Policy 2024, 21, 157–170. [Google Scholar]
Maryland Public Policy Institute. Socioeconomic factors and crime rates in Maryland. Public Policy Anal. 2022, 34, 442–460. [Google Scholar]
Sampson, R.J.; Wilson, W.J. Toward a theory of race, crime, and urban inequality. In Crime and Inequality; Hagan, J., Peterson, R.D., Eds.; Stanford University Press: Redwood City, CA, USA, 1995; pp. 37–56. [Google Scholar]
Massey, D.S.; Denton, N.A. American Apartheid: Segregation and the Making of the Underclass; Harvard University Press: Cambridge, MA, USA, 1993. [Google Scholar]
Lauritsen, J.L. Gender and violent victimization, 1973–2000. J. Quant. Criminol. 2003, 19, 79–110. [Google Scholar]
Anderson, E. Code of the Street: Decency, Violence, and the Moral Life of the Inner City; W.W. Norton & Company: New York, NY, USA, 1999. [Google Scholar]
Pratt, T.C.; Cullen, F.T. Assessing macro-level predictors and theories of crime: A meta-analysis. Crime Justice 2005, 32, 373–450. [Google Scholar] [CrossRef]
Clarke, R.V.; Harris, P.M. Auto theft and its prevention. Crime Justice 1992, 16, 1–54. [Google Scholar] [CrossRef]
Fisher, B.S.; Daigle, L.E.; Cullen, F.T. Unsafe in the Ivory Tower: The Sexual Victimization of College Women; Sage Publications: Thousand Oaks, CA, USA, 2010. [Google Scholar]
Bryant, C.D.; Willis, G. The Handbook of Criminology; Routledge: London, UK, 2006. [Google Scholar]
Krivo, L.J.; Peterson, R.D. Extremely disadvantaged neighborhoods and urban crime. Soc. Forces 1996, 75, 619–648. [Google Scholar] [CrossRef]

Figure 1. Flowchart for predictive model development in crime analysis.

Figure 2. Crime trends in 23 counties and one city of Maryland, from 2012 to 2023: (a) auto theft crime trend over time, (b) arson crime trend over time, (c) homicide crime trend over time, (d) burglary crime trend over time, (e) rape crime trend over time, (f) larceny crime trend over time, (g) shooting crime trend over time, (h) robbery crime trend over time, (i) assault crime trend over time, and (j) total crime trend over time.

Figure 3. Crime trends, grouped according to the victim’s race, over the studied years: (a) burglary, (b) larceny, (c) robbery, (d) assault, (e) shooting, (f) homicide, (g) rape, (h) arson, and (i) auto theft.

Figure 4. Crime trends according to the victim’s gender across crime types: (a) distribution, (b) assault, (c) shooting, (d) homicide, (e) larceny, (f) burglary, (g) rape, (h) robbery, (i) auto theft, and (j) arson.

Figure 5. Crime trends, grouped according to the victim’s age group across crime types: (a) larceny, (b) aggravated assault (c) auto theft, (d) arson, (e) burglary, (f) homicide, (g) shooting, (h) robbery, and (i) rape.

Figure 6. Clustering results. (a) Robbery rate trends (Cluster 0): gradual decline over time. (b) Robbery rate trends (Cluster 1): high variability across counties. (c) Robbery rate trends (Cluster 2): persistently high rates in Baltimore City. (d) Robbery rate trends (Cluster 3): steady declines across semi-urban regions. (e) Robbery rate trends (Cluster 4): sporadic spikes in low-population areas. (f) Rape rate trends (Cluster 0): gradual increase, with some recent declines. (g) Rape rate trends (Cluster 1): high variability across counties. (h) Rape rate trends (Cluster 2): persistently high rates in Baltimore City. (i) Rape rate trends (Cluster 3): steady declines in semi-urban regions. (j) Rape rate trends (Cluster 4): sporadic spikes in low-population areas. (k) Murder rate trends (Cluster 0): gradual increase with variability. (l) Murder rate trends (Cluster 1): high variability across rural counties. (m) Murder rate trends (Cluster 2): persistently high rates in Baltimore City. (n) Murder rate trends (Cluster 3): steady declines in semi-urban areas. (o) Murder rate trends (Cluster 4): sporadic spikes in low-population areas.

Figure 7. Contextual analysis of crime rates across Maryland Counties: (a) AGG assault rates by County, (b) robbery rates by county, (c) rape rates by county, (d) murder rates by county, (e) motor vehicle theft rates by county, (f) larceny theft rates by county.

Figure 8. Variations in R-squared values across different train–test split ratios.

Figure 9. Variations in root mean square error (RMSE) across the different train–test split ratios.

Figure 10. Feature importance plot.

Figure 11. SHAP analysis of feature importance.

Table 1. Data sources.

Data Source	Description	Years Covered	Relevance
[41]	Violent and property crime rates by municipality in Maryland.	2000–Present	Offers a longitudinal view of crime trends across Maryland’s regions.
[42]	Data on cumulative reductions in violent crime, illustrating the effectiveness of crime reduction initiatives.	2006–2014	Measures the success of crime prevention initiatives over time.
[43]	Detailed crime data for Montgomery County, MD.	1975–Present	Provides regional insights and trends within Montgomery County.
[44]	Detailed property and violent crime data for Anne Arundel County.	1975–Present	Offers a focused view of historical crime trends in a specific county.
[45]	Statewide totals for violent and property crime across Maryland.	2006–Present	Provides a macro view of statewide crime trends.
[46]	Data on domestic violence incidents across Maryland.	2010–2020	Assesses trends in domestic violence as a subset of violent crimes.
[47]	Recent crime incidents in Prince George’s County.	July 2023–Present	Provides up-to-date crime data for the current analysis.
[48]	Data on population growth, median household income, unemployment rates, and poverty levels for Maryland counties.	2012–2022	Provides socio-economic context for crime analysis, highlighting the correlations between crime rates and economic factors.
[49]	Rates of murder, rape, robbery, aggravated assault, breaking and entering, larceny theft, and motor vehicle theft across Maryland.	1975–Present	Provides granular crime rate data for specific crime types, allowing for detailed statistical analysis and prediction modeling.
[50]	Data on domestic and international migration trends in Maryland.	2001–2022	Explores how migration patterns impact crime rates by altering regional demographics and socio-economic conditions.

Table 2. Summary of clustering results.

Cluster	Counties	Summary
Cluster 0	Anne Arundel, Baltimore, Charles, Frederick, Montgomery	Gradual declines in robbery rates, with recent increases in murder and rape rates.
Cluster 1	Calvert, Carroll, Garrett, Howard, Queen Anne’s, St. Mary’s, Talbot, Washington, Worcester	High variability in robbery and murder rates; unpredictable trends.
Cluster 2	Baltimore City	Persistently high crime rates with slight reductions in robbery after 2018.
Cluster 3	Caroline, Cecil, Harford, Kent, Prince George’s, Wicomico	Steady declines in robbery, murder, and rape rates.
Cluster 4	Allegany, Dorchester, Somerset	Sporadic spikes in crime rates, driven by isolated incidents.

Table 3. Summary table of crime rates: comparison by counties.

Crime Type	High Rates (Counties)	Low Rates (Counties)	General Trend
AGG. Assault	Baltimore, Dorchester	Garrett, St. Mary’s	Concentrated in urban areas
Robbery	Prince George’s, Baltimore	Garrett, Talbot	Urban-centric
Rape	Worcester, Montgomery	Queen Anne’s, Kent	Higher variability across counties
Murder	Prince George’s, Charles	Talbot, Somerset	Higher in metropolitan areas
Motor Vehicle Theft	Prince George’s, Baltimore	Kent, Talbot	Urban-focused
Larceny Theft	Dorchester, Anne Arundel	Garrett, Talbot	More evenly distributed

Table 4. Model performance across the different train–test splits.

Model	R² (Mean)	R² Variance	RMSE (Mean)	RMSE Variance
Linear Regression	0.50	0.010	15.0	2.25
Ridge Regression	0.58	0.008	13.2	1.89
Lasso Regression	0.45	0.012	17.5	3.10
Random Forest	0.85	0.003	8.4	0.64
Gradient Boosting	0.80	0.004	9.7	0.81
XGBoost	0.83	0.002	9.1	0.50
Extra Trees	0.82	0.005	9.5	0.72
CatBoost	0.84	0.003	8.8	0.55
SVR (RBF Kernel)	0.75	0.009	12.3	1.60
Stacking Regressor	0.88	0.002	7.0	0.36

Table 5. The best hyperparameters and test performance for traditional and ensemble models, along with cross-validation variance.

Model	Best Hyperparameters	Test R²	R² Variance	Test RMSE	RMSE Variance
Linear Regression	−(no hyperparameters)	0.50	0.010	15.0	2.25
Ridge Regression	α = 0.5	0.58	0.008	13.2	1.89
Lasso Regression	α = 0.1	0.45	0.012	17.5	3.10
Random Forest	n_estimators = 100, max_depth = 10	0.85	0.003	8.4	0.64
Extra Trees	n_estimators = 100, max_depth = 10	0.82	0.005	9.5	0.72

Table 6. The best hyperparameters and test performance for advanced models (boosting, SVR, and stacking), along with cross-validation variance.

Model	Best Hyperparameters	Test R²	R² Variance	Test RMSE	RMSE Variance
Gradient Boosting	learning_rate = 0.1, n_estimators = 100, max_depth = 3	0.80	0.004	9.7	0.81
XGBoost	learning_rate = 0.1, max_depth = 4, subsample = 0.8	0.83	0.002	9.1	0.50
CatBoost	iterations = 200, depth = 6, learning_rate = 0.05	0.84	0.003	8.8	0.55
SVR (RBF kernel)	kernel = RBF, C = 10, γ = 0.1	0.75	0.009	12.3	1.60
Stacking Regressor	base models = {RF, XGB, SVR}, meta-model = Ridge	0.88	0.002	7.0	0.36

Table 7. Neural network performance.

Configuration	R-Squared	MSE
Base Model (3 Dense Layers)	0.87	0.022
Optimized Model (85–15 split)	0.93	0.007
CNN Architecture	0.74	0.031

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bandpey, Z.; Piri, S.; Shokouhian, M. Integrating Machine Learning Techniques for Enhanced Safety and Crime Analysis in Maryland. Appl. Sci. 2025, 15, 4642. https://doi.org/10.3390/app15094642

AMA Style

Bandpey Z, Piri S, Shokouhian M. Integrating Machine Learning Techniques for Enhanced Safety and Crime Analysis in Maryland. Applied Sciences. 2025; 15(9):4642. https://doi.org/10.3390/app15094642

Chicago/Turabian Style

Bandpey, Zeinab, Soroush Piri, and Mehdi Shokouhian. 2025. "Integrating Machine Learning Techniques for Enhanced Safety and Crime Analysis in Maryland" Applied Sciences 15, no. 9: 4642. https://doi.org/10.3390/app15094642

APA Style

Bandpey, Z., Piri, S., & Shokouhian, M. (2025). Integrating Machine Learning Techniques for Enhanced Safety and Crime Analysis in Maryland. Applied Sciences, 15(9), 4642. https://doi.org/10.3390/app15094642

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Integrating Machine Learning Techniques for Enhanced Safety and Crime Analysis in Maryland

Abstract

1. Introduction

2. Literature Review

3. Methodology

3.1. Data Sources

3.2. Input Variables

3.3. Data Collection and Preprocessing

3.4. Data Analysis

3.4.1. Crime Trends over Time: A Comparative Analysis Across Crime Types

3.4.2. Crime Trends, Shown by Victim’s Racial Group, over the Years

3.4.3. Crime Crime Trends, Grouped by Victim’s Gender, Across Crime Types

3.4.4. Crime Trends According to the Victim’s Age Group Across Crime Types

3.4.5. Clustering Analysis of Crime Trends Across Maryland Counties

3.4.6. Contextual Analysis of Crime Rates Across Maryland Counties: Total and Average Crime Comparisons

3.5. Train–Test Split

3.6. Principal Component Analysis (PCA) for the Crime Rate Index

3.7. Model Development

3.7.1. Cross-Validation for Robust Model Assessment

3.7.2. Performance Metrics and Feature Engineering

3.7.3. Reproductivity

3.7.4. Scope of Predictions

3.8. Model Comparisons and Performances

4. Discussion

5. Conclusions

6. Limitations and Future Research Directions

7. Disclosure: Crime Reporting Bias and Ethical Considerations

7.1. Data Integrity and Bias Mitigation

7.2. Feature Importance and Systematic Bias Analysis

7.3. Ethical Implications of Model Deployment

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI