1. Introduction
Over the past decade, the residents of Maryland’s urban areas have witnessed a significant 12% increase in violent crimes [
1]. Meanwhile, property crimes, while generally in decline, still account for over 60% of total reported incidents statewide [
2]. The complexity of these crime dynamics is further compounded by the underreporting and classification challenges that often plague traditional crime databases [
3]. The disparity in crime reporting and response effectiveness across Maryland’s various regions further motivates the need for improved analytical methods.
Crime analysis has traditionally relied on historical data and statistical methods to uncover trends and predict future criminal activities. However, with the rise of vast and complex datasets, conventional approaches often fail to fully capture the intricacies of crime patterns, especially in urban environments like Maryland. In response, machine learning (ML) has emerged as a powerful tool capable of analyzing large volumes of data while uncovering the most influential factors, non-linear relationships, and hidden patterns that traditional methods might overlook.
The intersection of crime analysis and machine learning is particularly crucial for Maryland, a state characterized by diverse urban, suburban, and rural areas, each having unique crime dynamics. Maryland’s urban centers, including Baltimore, encounter diverse crime challenges, particularly in the field of violent crime, while its suburban and rural regions encounter different types of criminal activity, such as property crimes and drug-related offenses. Machine learning models offer the potential to analyze and predict these varying crime patterns at a granular level, empowering law enforcement and policymakers with the tools needed to proactively address crime [
4,
5].
This study introduces a novel approach by tailoring advanced machine learning models to specifically address Maryland’s unique urban–rural composition and demographic diversity. Unlike existing studies that often generalize findings from large metropolitan areas like New York or Los Angeles, our research focuses on the distinct challenges faced by Maryland. We differentiate our methodology by combining random forest, gradient boosting, and XGBoost 3.0—optimized not just for predictive accuracy but also for their applicability across Maryland’s varied landscapes—to capture nuances in crime patterns that generic models typically miss [
6,
7].
This tailored approach allows us to provide more actionable insights for law enforcement and policymakers, enabling them to develop more precise, data-driven public safety strategies that are specifically effective for Maryland’s unique settings. Furthermore, by exploring crime patterns across different demographic factors, including age, gender, and race, utilizing comprehensive crime data collected within the state, this study aims to identify and inform interventions that address disparities in crime rates and victimization, thereby fostering more equitable and effective public safety measures [
8].
The societal impact of applying machine learning to crime analysis in Maryland is profound. By harnessing the predictive power of these models, law enforcement agencies can anticipate crime hotspots, understand demographic factors influencing crime, and implement proactive interventions [
9]. This approach not only aids in identifying underlying socio-economic factors contributing to crime but also assists policymakers in designing more targeted and effective crime prevention programs, fostering safer communities across the state.
The findings of this study are intended to inform law enforcement and community stakeholders, aiding in the crafting of more effective interventions that address the specific needs of these demographics. By highlighting areas with pronounced crime rates and identifying trends over time, we provide a foundation for future research and policymaking that is responsive to the changing landscape of urban crime. Through this investigation, we seek to contribute to the broader discourse on public safety, equity, and justice, ultimately aiming to foster a safer and more inclusive community for all residents [
10,
11,
12].
To guide readers through the rest of the paper, the remaining sections are structured as follows. First, we review the relevant literature on crime analysis and machine learning, establishing the context and identifying gaps in the literature addressed by our work. Next, we detail the methodology, including the data sources, preprocessing steps, and configuration of the machine learning models employed. Subsequently, we present the results of our data analysis, highlighting key findings on crime trends and model performance across Maryland. This is followed by a discussion of the implications of these findings, with a focus on differences between urban and rural crime dynamics. Finally, we conclude the paper by summarizing its contributions and outlining the study’s limitations, as well as identifying directions for future research.
2. Literature Review
Recent advancements in crime analysis underscore the importance of robust data collection and precise analytical models. The Barcelona Victimization Survey (2015–2020), for instance, highlights the significance of comprehensive data collection in understanding community safety and crime dynamics across neighborhoods [
13]. This survey, alongside other studies [
14], provides insights that are often overlooked with traditional police records, emphasizing the need for supplementary data sources in crime analysis.
The reliability of crime data has been enhanced through the application of models such as the quasi-simplex model (QSM) and its multi-item extension (MI-QSM). These models decompose observed crime data variance into specific components, representing true scores, method effects, and random errors, using Bayesian estimation techniques [
15,
16]. Such methodological rigor is crucial for enhancing data validity and adaptability for analyzing regional crime trends [
17,
18].
While studies like these focus on survey data, recent research underscores the value of integrating multiple datasets for improved crime prediction [
19]. Our approach advances beyond traditional spatiotemporal co-kriging (ST-Cokriging) methods, which combine crime records with high-resolution activity data from police operations, improving short-term forecasting accuracy and enabling the better prediction of hotspots and targeted interventions [
20]. Unlike the ST-Cokriging method, which primarily improves short-term forecasting accuracy, our models—utilizing random forest, gradient boosting, and XGBoost—integrate a broader range of temporal and spatial data, enhancing our ability to predict crime hotspots and trends over longer periods and across more diverse settings.
Deng, L. and colleagues further advanced predictive methods by introducing spatiotemporal lag variables, which mitigate the spatial and temporal dependencies embedded in crime data. Their study demonstrated that accounting for these dependencies using tree-based machine learning models significantly enhances predictive accuracy, particularly when applied to data from Dallas, Texas. By modeling both environmental and demographic factors, they provided a more accurate representation of crime dynamics and offered practical guidance for proactive urban security strategies [
21]. Our study builds on these advancements by not only accounting for spatial and temporal dependencies but also incorporating machine learning algorithms that adapt dynamically to changes in crime patterns due to seasonal and socioeconomic factors. This provides a more detailed and adaptable model than those typically employed in studies like the one conducted in Dallas, Texas.
Current research divides temporal forecasting into short-term, medium-term, and long-term categories, employing methods such as LASSO regression and neural networks. Spatial prediction operates across micro-, meso-, and macro-levels using models like kernel density estimation (KDE) and risk terrain modeling (RTM), enabling the efficient management of police resources by forecasting high-crime areas and trends [
22]. The review emphasized the importance of integrating reinforcement learning techniques and Shapley additive explanations (SHAPs) to enhance the interpretability and practicality of crime prediction models. This multi-scale approach underscores the complexities in crime dynamics, particularly in regions with diverse urban and rural environments, such as Maryland, where nuanced and adaptable models are required to account for varying crime patterns [
23].
In alignment with these studies, subsequent authors proposed an artificial intelligence model for predicting per capita violent crimes in urban areas. This model integrated socioeconomic and law enforcement data to generate accurate crime forecasts, optimizing resource allocation by leveraging a genetic programming (GP) framework enhanced with local search optimization. The system was tested across various US cities, demonstrating a lower error rate compared to other state-of-the-art models, highlighting its suitability for large datasets and the evolving needs of smart city development. A crucial innovation in their approach is the use of semantic genetic programming, which improves model interpretability by focusing on the behavior of programs rather than just their syntax. The study’s findings confirm the effectiveness of AI-driven crime forecasting methods in guiding resource allocation and improving urban security [
24].
The use of spatiotemporal prediction, including micro-, meso-, and macro-level classifications, provides a framework for proactively managing crime through optimal patrol routes and targeted intervention strategies [
25]. However, limitations persist, such as the lack of standardized evaluation systems and effective methods to handle sparse datasets. The integration of advanced machine learning models with socioeconomic and environmental data offers promising directions for future research [
26].
Significant discrepancies in crime rate calculations can arise when using different population bases. Studies comparing residential and workday population-based crime rates reveal varying spatial patterns of crime hotspots. This variation emphasizes the importance of considering population context and selecting appropriate denominators, which can affect crime rate assessments and prevention strategies. Comparative analyses in Islington, London, highlighted variations across data sources, including police records, ambulance data, and synthetic crime data. These findings support the use of multiple data sources to achieve a comprehensive understanding of crime dynamics, particularly in regions like Maryland with urban and rural diversity [
27,
28].
Traditional metrics for crime concentration, such as the Gini coefficient, have been critiqued for their limitations in effectively capturing complex crime distribution patterns. Alternative probabilistic models, such as Poisson distributions, offer a more nuanced framework for understanding individual victimization rates, which can inform targeted crime prevention strategies [
29]. The importance of advanced statistical methods to capture crime concentrations accurately has been reiterated in studies examining spatial distribution and temporal patterns [
30].
Machine learning (ML) has become a transformative tool for crime prediction, especially in large metropolitan areas with comprehensive datasets. Foundational studies, such as those conducted in Los Angeles, employed ML to develop predictive policing models that identified crime hotspots using historical crime data. Techniques like logistic regression, decision trees, and k-nearest neighbors (k-NN) have demonstrated the ability to uncover patterns that can be overlooked with traditional statistical methods [
31].
Recent studies have expanded ML applications by integrating socio-economic and urban metrics to enhance predictive accuracy. The integration of temporal and spatial data in predictive models has been shown to support early warning systems for identifying temporary crime hotspots. Additionally, ML has been used to predict specific types of crime, such as domestic violence, providing valuable insights for policy interventions [
30].
Research in New York City and Chicago has demonstrated the benefits of integrating multi-source data for more comprehensive crime analysis. Studies have utilized ML models like random forests and gradient boosting to merge sociodemographic data with historical crime records, leading to improved accuracy in crime forecasting. This approach has been particularly effective for resource allocation and crime prevention strategies. The use of data from diverse sources, such as census information, economic indicators, and police records, has proven valuable for contextualizing crime trends [
32].
Practical implementations, such as the deployment of geospatial data for patrol route planning, highlight the real-world impact of predictive policing. In Los Angeles and New York City, predictive models have been used not only to forecast hotspots but also to refine intervention strategies, contributing to a reduction in crime rates. These case studies validate the feasibility of incorporating machine learning for operational improvements in law enforcement [
33].
In Chicago, predictive risk modeling has gone beyond location-based forecasting to individual-level crime prediction. By incorporating attributes such as prior criminal history, age, gender, and social networks, researchers have developed models capable of predicting whether an individual is likely to commit a crime, enabling targeted interventions. Such models contribute to early intervention and rehabilitation efforts aimed at reducing recidivism [
34].
While significant progress has been made in applying ML for crime analysis in urban settings, there remains a gap in the literature in terms of research on regions with mixed urban, suburban, and rural characteristics, such as Maryland. These areas present unique challenges related to data quality and availability. The integration of community surveys, migration data, and socio-economic indicators can improve the robustness of predictive models in such contexts [
35]. Researchers have suggested that adaptive and flexible models are essential for addressing these challenges [
36].
The expansion of ML in crime prediction must also address ethical concerns and the potential for algorithmic bias. Studies have shown that without careful design, ML models may reinforce existing inequalities and lead to biased policing outcomes. Transparent model development and the inclusion of diverse stakeholder input are crucial for ensuring fairness and equity [
37]. Additionally, the application of interpretable ML techniques can help build trust and facilitate the implementation of data-driven policies [
38].
This review of the literature demonstrates that while ML has significantly advanced crime prediction and analysis, future work should focus on applying these methods in diverse regional contexts and addressing ethical challenges. Integrating multi-source data, refining model transparency, and incorporating policy-relevant insights are essential steps toward creating adaptable, equitable, and effective crime prevention strategies [
39].
Despite advances in crime analysis, existing methodologies often struggle with integrating data from heterogeneous sources and accurately predicting crime hotspots. Previous studies have demonstrated the utility of machine learning (ML) techniques, but they frequently rely on limited data types or overlook the complexities introduced by diverse urban and rural environments. Our research addresses these gaps by utilizing a more comprehensive dataset that includes police reports, crime databases, and demographic statistics. Moreover, most existing studies do not sufficiently explore the predictive accuracy of their models across different crime types and regions, nor do they fully assess the practical implications of their findings for law enforcement and public safety. Our study aims to fill these critical gaps by providing a nuanced analysis of crime patterns and predictive model performance, thereby offering actionable insights that can significantly enhance crime prevention and safety strategies in Maryland. This approach not only bridges this gap in the literature but also contributes directly to more effective and informed public policy and safety measures [
40].
Building on the foundational work discussed, our study introduces significant innovations in the application of machine learning to crime analysis. Our research integrates a comprehensive range of socioeconomic, demographic, and environmental factors using advanced machine-learning techniques such as random forest, gradient boosting, and XGBoost. This integration allows for more detailed and adaptable modeling of crime patterns, particularly in Maryland’s diverse urban and rural landscapes. Furthermore, our methodology uniquely addresses the challenges of data sparsity and the need for model interpretability, which have been persistent limitations of earlier research. By employing a multi-source data framework that enhances the predictive accuracy and applicability of our models, our study not only fills critical gaps in the literature but also sets new benchmarks for effective and equitable crime prediction strategies. These contributions mark a significant step forward in the utilization of data science for public safety, offering robust tools that are adaptable to the dynamic nature of crime and its prevention.
3. Methodology
The methodology for this crime analysis research, shown in
Figure 1 focuses on performing extensive data analysis and developing a robust predictive model to estimate the crime index in Maryland counties. This model integrates various stages of data preprocessing, feature engineering, and advanced machine learning techniques to ensure accurate and reliable predictions. By incorporating statistical methods, machine learning algorithms, and ensemble techniques, we aim to identify patterns of crime and predict crime trends effectively. This study specifically employs non-spatial predictive models, acknowledging that spatial correlations, which can significantly influence crime patterns, are not incorporated in the current analysis. This methodological focus was chosen to initially explore and understand the broader, non-spatial factors affecting crime rates across various counties, due to both the initial scope of our research and the limitations of the dataset available. Future research will aim to integrate spatial statistical models to address these limitations and provide a more comprehensive analysis of regional crime dynamics.
3.1. Data Sources
To conduct a comprehensive analysis of crime patterns in Maryland, this study utilizes a diverse array of data sources which are shown in
Table 1. These datasets provide both temporal and regional coverage, encompassing a wide range of crime types, demographic variables, and socio-economic factors. By integrating crime statistics, such as murder, rape, robbery, aggravated assault, break and entry, larceny theft, and motor vehicle theft, from 23 counties in Maryland with additional socio-economic indicators such as unemployment rates, household income, and population and migration trends, this study is able to present a holistic view of crime across the state from 2012 to 2023, based on the availability of variables in those years. The following table outlines the primary datasets employed, each being chosen for its relevance, reliability, and ability to contribute valuable insights into the complexities of Maryland’s crime landscape.
3.2. Input Variables
This section delineates the criteria for selecting a comprehensive range of input variables, which are then utilized to assess and predict crime rates across various Maryland counties. These variables are segmented into demographic, economic, and crime-specific indicators, each of which is critical for crafting a nuanced analysis of crime dynamics. The variables included here are outlined in
Appendix A (
Table A1) and were chosen through a systematic review of the literature, which pinpointed those factors frequently associated with crime trends and their predictive power. This selection includes demographic data such as population metrics and economic indicators like median household income and unemployment rates, along with crime-specific rates like murder, rape, and robbery. These are calculated using standardized formulas to ensure consistency and comparability across different geographic regions and time periods.
The criteria for variable selection were anchored on their established relevance in previous research, providing a robust analytical base that leverages proven predictors of crime. This structured approach enhances the scalability and applicability of our analysis across diverse datasets, minimizing manual selection efforts. Future datasets can utilize automated algorithms that select variables based on their statistical significance and predictive validity, streamlining the variable inclusion process.
3.3. Data Collection and Preprocessing
The initial stage of our predictive model development involved an extensive data collection effort that integrated detailed crime types such as murder, rape, robbery, aggravated assault, burglary, larceny theft, and motor vehicle theft. To amplify the model’s predictive power and address the complex socioeconomic dynamics, we included several additional variables:
Socio-economic indicators: Unemployment rates, household income, population below the poverty line, and education levels.
Demographic variables: Migration patterns (domestic and international) and age distribution.
Geographical context: Urban–rural classifications and each area’s proximity to known crime hotspots.
The integration of these variables was designed to capture the complex interrelationships among socioeconomic conditions, demographic trends, and crime rates, facilitating a comprehensive understanding of regional disparities.
Handling missing data: Missing values in continuous variables (e.g., median household income) were addressed through mean imputation, while categorical variables (e.g., urban/rural classifications) were handled using mode imputation. Advanced techniques, such as k-nearest neighbors (KNN) imputation, were considered for variables with higher missing rates to maintain dataset integrity.
Data type correction: Variables representing crime rates and socioeconomic indicators were converted to numeric formats, ensuring compatibility with machine learning algorithms. Errors in data entry, such as non-numeric characters appearing in numeric fields, were identified and corrected.
A rigorous preprocessing pipeline ensured data quality and readiness for modeling:
Normalization and scaling: To address the varying scales of features, Min-Max scaling was applied, thereby rescaling numerical variables to a uniform range between 0 and 1. This step ensured that no single variable disproportionately influenced model outcomes, particularly for high-magnitude variables like population and income.
Outlier detection and treatment: Outliers were identified using the Z-score method, flagging values beyond three standard deviations from the mean. Domain knowledge was applied to determine appropriate actions, such as capping extreme values or removing anomalies that could distort crime trends.
Advanced feature engineering techniques were employed to enhance the dataset’s predictive capacity:
Interaction terms: Relationships between key variables (e.g., unemployment rates and migration patterns) were modeled through interaction terms to reflect their combined influence on crime rates.
Higher-order features: Polynomial features (up to degree 2) were created for significant predictors such as household income and poverty levels to model non-linear relationships.
Feature selection: Recursive feature elimination (RFE) was utilized to identify impactful predictors, thereby improving model efficiency and interpretability. Importance scores from ensemble models (e.g., random forest and gradient boosting) guided the selection of features most relevant to crime prediction.
This enhanced preprocessing framework, incorporating socioeconomic, demographic, and geographic dimensions, ensures that the dataset is clean, normalized, and tailored to robust predictive modeling. The combination of advanced feature engineering, normalization, and outlier handling aligns the methodology with high-impact journal standards while reflecting the complexities highlighted in the study’s results.
3.4. Data Analysis
The data analysis section of this research utilizes diverse datasets to examine crime trends across Maryland, focusing on incidents and their clearance rates across multiple jurisdictions and time periods. By employing advanced machine learning techniques, such as random forests and clustering methods, alongside time series regression models, the study analyzes the demographic and geographic patterns of crime victimization using data from census and law enforcement sources. Clustering techniques identify patterns in victimization across different demographic factors, such as age, gender, race, and geographical distribution, offering critical insights into the dynamics of crime. The findings from this section illuminate the complex factors contributing to crime trends and their differential impact across various population segments, offering guidance for targeted interventions and informed policy decisions. However, potential biases, such as changes in crime reporting practices during this period, should be considered when interpreting the results. The approach and findings have the potential to inform similar crime analysis efforts in regions with diverse urban and rural contexts.
3.4.1. Crime Trends over Time: A Comparative Analysis Across Crime Types
The provided graphs offer an observational view of crime trends across Maryland from 2012 to 2023, highlighting notable changes in various crime types over the years.
Figure 2a illustrates the trend for auto theft, showing a relatively stable pattern until a sharp increase in 2023, possibly indicating changes in reporting practices or actual incidents.
Figure 2b presents the arson trend, which remained relatively steady through most of these years, followed by a rise in 2023.
Figure 2c, depicting homicide, shows a peak around 2015, with fluctuations in subsequent years and a general decline through 2023. This trend may reflect specific factors or policy changes influencing violent crime in Maryland [
51].
Figure 2d tracks burglary reporting, which exhibits a continuous decline over the period and a marked decrease after 2016, reaching one of its lowest recorded levels by 2023. However, data covering further years would be necessary to confirm if this trend is sustained.
Figure 2e, which depicts the trend regarding rape incidents, shows variability over the years, including peaks around 2017 and 2019, followed by a stabilization period.
Figure 2f highlights larceny as the most frequent crime type throughout the period, despite a recent decline in incidents, suggesting either an improvement in preventive measures or changes in social conditions.
Figure 2g illustrates the trend regarding shootings, which remains relatively stable from 2014 onward, with a slight increase in 2023 that warrants closer examination.
In
Figure 2h, robbery maintains a relatively stable trend until 2016, followed by a gradual decline through 2023.
Figure 2i shows the trend for assault, which remains consistent for much of the period, but begins a noticeable decline starting in 2023. Finally,
Figure 2j provides a comprehensive overview of the total crime trends across categories, with larceny leading the figures in terms of frequency over the years. The graphs collectively show declines across most categories by 2023, with significant reductions observed in the reporting of auto theft, burglary, and assault.
It is essential to interpret these recent declines with caution, as they may reflect variations in reporting practices or data completeness, particularly for more recent years. This observational analysis does not seek to establish causation but instead highlights notable shifts in crime types over time, providing insight into evolving patterns that could inform future research.
While the findings offer valuable insights into Maryland’s crime landscape, they also underscore the need for further analysis, particularly regarding external factors that may have influenced these trends. For instance, research by the Maryland Public Policy Institute [
52] highlights a correlation between high poverty rates and elevated crime levels in certain areas, suggesting that socio-economic disparities are crucial to understanding crime trends. Additionally, a report by the Council of State Governments Justice Center (2024) [
51] emphasizes that policy changes and legislative reforms can significantly impact crime patterns, demonstrating the influence of external factors on local crime rates. These limitations underscore the need for a nuanced approach to interpreting trends since data completeness, regional influences, and socio-economic and policy-related factors may significantly impact the observed patterns.
3.4.2. Crime Trends, Shown by Victim’s Racial Group, over the Years
This section examines the trends in various crime types across different racial groups over the years, specifically focusing on burglary, larceny, robbery, assault, shooting, homicide, arson, rape, and auto theft (See
Figure 3). The analysis reveals notable differences in victimization rates among racial groups, suggesting underlying systemic and socio-economic factors that contribute to these disparities.
For burglary (
Figure 3a), both White and Black/African American victims experienced the highest counts, with a significant decline observed in both groups, particularly from 2021 onward. Between 2018 and 2023, burglary rates among Black/African American victims decreased by approximately 48%, while the number of White victims saw a reduction of 47%. This decline aligns with broader crime reduction trends but could also reflect shifts in reporting practices or changes in community interventions.
For larceny (
Figure 3b) and robbery (
Figure 3c), the trends mirror each other, with both crimes showing higher victimization rates among White and Black/African American populations, while other racial groups have consistently lower numbers. Larceny incidents among Black/African American victims remained relatively stable, with a slight increase of 0.4%, while White victims experienced a decline of 12%. Robbery rates decreased by 22% for Black/African American victims and 29% for White victims, reflecting consistent declines across both groups.
Assault (
Figure 3d) shows similar trends, with White and Black/African American victims experiencing the highest rates, followed by a gradual decline across all racial groups. Between 2018 and 2023, assault incidents rose by 2.5% for White victims and 6.8% for Black/African American victims, indicating potential differences in reporting practices or the underlying factors driving these increases. Shooting (
Figure 3e) and homicide (
Figure 3f) incidents predominantly affected Black/African American and White populations. Shooting incidents among Black/African American victims decreased by approximately 11%, while the number of White victims exhibited a smaller decline of 10%. Homicide rates declined by 19% for Black/African American victims but increased by 44% for White victims, suggesting divergent trends that warrant further investigation.
Arson, while less frequent overall (
Figure 3g), shows significant variation, with Black/African American victims seeing a modest increase of 11%, while White victims experienced a sharp rise in attacks of 533%, potentially due to localized events or anomalies in the dataset. The rape data (
Figure 3h) revealed that White and Black/African American victims have been the most affected, with White victims consistently experiencing higher rates. However, there was a decline in rape cases for all racial groups by 2023, with Black/African American victims showing a reduction of 41% and White victims a drop of 48%. Lastly, auto theft (
Figure 3i) displays a significant rise, with Black/African American victims experiencing an increase of 130%, and White victims seeing a dramatic rise of 174%.
Our findings highlight systemic and structural factors that may contribute to the observed disparities in victimization rates across racial groups. Previous research suggests that poverty, neighborhood characteristics, access to resources, and differential policing practices play critical roles in shaping these patterns [
51,
52,
53,
54]. Additionally, racial disparities in crime reporting and enforcement may further exacerbate these trends, underscoring the importance of considering potential biases in the data.
The extreme increase in arson rates for White victims suggests the need for further investigation into any localized factors driving this anomaly. Similarly, the sharp rise in auto theft for both Black/African American and White victims highlights the potential influence of broader social or economic conditions, such as rising unemployment or shifts in law enforcement priorities.
3.4.3. Crime Crime Trends, Grouped by Victim’s Gender, Across Crime Types
Figure 4 provides a comprehensive analysis of crime victimization trends by gender, offering valuable insights into the distribution and dynamics of various crime types. Overall, males are more frequently victimized in violent crimes such as assault, shooting, and homicide, while property crimes like larceny and burglary display a more balanced distribution between genders. In contrast, rape remains predominantly a gendered crime, with female victims vastly outnumbering males.
Violent crimes consistently highlight the disproportionate victimization of males. In assault (
Figure 4b), male victims outnumber females across the timeline, though both genders experienced a gradual decline in victim counts after 2020, suggesting improvements in crime prevention or intervention efforts. Similarly, shooting incidents (
Figure 4c) reveal a marked dominance of male victimization, peaking between 2015 and 2018 before declining steadily for both genders. This pattern reflects systemic factors such as urban violence, exposure to firearm-related risks, and societal norms that place males at higher risk [
52,
54]. Homicide data (
Figure 4d) further emphasize this disparity, with male victim counts consistently surpassing those for females. The persistence of higher male victimization in these crimes aligns with studies showing the association between masculinity, risky behavior, and increased exposure to violence [
55].
Property crimes such as larceny and burglary exhibit different patterns. In larceny (
Figure 4e), victimization is relatively balanced between genders, showing fluctuations but generally maintaining a stable trend over the years. In contrast, burglary (
Figure 4f) shows a decline in victim counts for both genders after 2021, which likely reflects the impact of improved community-based interventions, enhanced security measures, and effective law enforcement strategies targeting property crimes [
56]. Auto theft (
Figure 4i), on the other hand, experienced a peak in 2022 for both genders, followed by a significant decline in 2023. This drop may be attributed to advancements in vehicle anti-theft technologies and the growing role of digital tools in crime prevention [
57].
Rape (
Figure 4g) stands out as a profoundly gendered crime. Female victimization significantly exceeds male victimization throughout the timeline, with female victim counts peaking in around 2020 before gradually declining. Male victimization remains minimal in this category, underscoring the stark gender-specific nature of sexual violence. These trends highlight the critical need for targeted policies addressing sexual violence and support systems for survivors [
58].
Arson (
Figure 4j), although reported less frequently overall, shows an even distribution of victimization between genders. Both genders display a declining trend, particularly over the last two years, reflecting broader reductions in arson incidents. This decline may indicate the success of fire prevention strategies and community awareness [
59].
The overall trends suggest that violent crimes such as assault, shootings, and homicide disproportionately impact males, reflecting systemic risks and societal patterns. In contrast, property crimes like larceny and burglary exhibit more balanced gender distributions, suggesting shared vulnerabilities. The gendered nature of rape underscores the critical need for gender-specific strategies to address and prevent sexual violence. The sharp declines observed in 2023 across most crime types may point to changes in crime reporting practices, enhanced law enforcement efforts, or broader societal shifts, such as post-pandemic recovery and community resilience initiatives. However, these patterns must be interpreted cautiously, considering potential biases in data reporting and collection.
These findings emphasize the need for targeted interventions tailored to the unique vulnerabilities of each gender. Male-focused programs addressing violence and systemic risk factors, alongside female-centered initiatives for sexual violence prevention and support, are critical to reducing these disparities. Strengthening community-based programs and enhancing data collection methods to capture intersectional insights—such as the interplay of gender with race and socio-economic status—would provide a more nuanced understanding of these dynamics. Additionally, future research should employ longitudinal and mixed-method approaches to uncover the causal factors behind these trends and inform equitable and effective crime prevention strategies.
3.4.4. Crime Trends According to the Victim’s Age Group Across Crime Types
The analysis of crime victimization trends across age groups, as illustrated in
Figure 5, reveals notable patterns that provide insights into the dynamics of various crimes. Individuals aged 30–40 and 40–50 experience the highest levels of victimization for many crime types, including aggravated assault, larceny, and auto theft. These findings suggest that individuals in their prime working years are particularly vulnerable to these types of crimes, due to their active lifestyles, frequent interactions with public spaces, and higher rates of economic activity.
Violent crimes such as aggravated assault, shooting, and homicide disproportionately affect younger adults aged 20–40. Aggravated assault (
Figure 5b) demonstrates a sharp concentration of victims in this age group, tapering off among older populations. This pattern reflects the increased exposure of younger individuals to high-risk environments and conflict-prone situations, such as urban centers or workplaces with elevated stress levels [
53]. Similarly, shootings (
Figure 5f) and homicides (
Figure 5e) follow this trend, with victimization peaking for individuals in their thirties. These crimes are often linked to systemic issues like poverty, social inequality, and gang-related activities, which disproportionately impact younger populations [
60,
61]. Rape (
Figure 5h) exhibits a distinct pattern, with victimization heavily concentrated among individuals aged 20–40. The sharp decline in cases for those aged 60 and above reflects reduced exposure to social environments where such crimes are more likely to occur and age-related lifestyle changes [
57].
Property crimes like larceny, auto theft, and burglary show broader age distributions but exhibit distinct trends. Larceny (
Figure 5a) emerges as one of the most widespread crimes, affecting individuals across all age groups. Victimization peaks in the 30–40 age range but remains significant among older adults aged 60–70, likely due to the targeting of elderly individuals for theft or fraud [
55]. Auto theft (
Figure 5i) similarly peaks for those aged 30–50, reflecting higher rates of vehicle ownership and usage in these groups. The decline in auto theft victimization after age 60 may correspond to reduced vehicle ownership among older adults [
57]. Burglary (
Figure 5d) disproportionately impacts individuals aged 30–60, with notable victimization seen in the 50–70 age range. This trend suggests that the homes of older individuals may be targeted due to perceived vulnerabilities, such as lower levels of home security or physical limitations.
Arson (
Figure 5c), while a less frequent crime overall, shows a relatively even distribution of victims across age groups. Individuals aged 30–40 and 40–50 experience slightly higher rates, which is consistent with the general trend of these age groups being more affected by various crimes. This even distribution reflects the sporadic nature of arson incidents and their dependence on factors unrelated to specific age groups.
The data collectively suggest that victimization risks vary significantly across age groups, with younger adults (20–40) being disproportionately affected by violent crimes such as aggravated assault, rape, shooting, and homicide. Conversely, older adults (60 and above) face a higher risk of property-related crimes, including larceny and burglary. These trends highlight the need for age-specific crime prevention strategies. For younger populations, prevention efforts should focus on reducing their exposure to environments that foster violent crime, such as urban areas with high crime rates or conflict-prone settings. For older individuals, initiatives should emphasize protection against property crimes, including enhanced home security systems, fraud awareness programs, and community-based support networks.
By analyzing crime trends through the lens of victim demographics, seasonal variations, and policy influences, this study underscores the complex interplay of systemic, socio-economic, and environmental factors shaping crime dynamics. The findings offer critical guidance for developing targeted interventions and data-driven strategies to reduce crime.
3.4.5. Clustering Analysis of Crime Trends Across Maryland Counties
Understanding regional crime dynamics is a critical aspect of criminology and public policy. This study examines crime rate trends for robbery, murder, and rape across Maryland counties using the K-means clustering approach to uncover patterns over time and across geographic areas. The choice of the K-means technique was motivated by its effectiveness in identifying distinct groups within large datasets, facilitating the analysis of geographic and temporal patterns in crime data.
Data preprocessing: Prior to clustering, the crime rate data underwent standardization using the StandardScaler 1.6.1. This crucial step ensured that each variable contributed equally to the distance measure used in the K-means algorithm, which is essential for accurate clustering.
Dimensionality reduction: To manage the high-dimensional nature of crime data effectively, principal component analysis (PCA) was performed exclusively on the training data before clustering. This strategic application ensures that the test data remain untouched, preventing data leakage and preserving the integrity of the evaluation process. By reducing dimensionality while maintaining the significant variance, PCA enhances the performance of the K-means algorithm and ensures more accurate cluster formation.
Determination of optimal clusters (k): The selection of the optimal number of clusters (k) was guided by both the elbow method and silhouette score analysis. The elbow method involves plotting the within-cluster sum of squares (distortion) against a range of k-values from 1 to 10. The plot in question indicated an elbow at k = 5, suggesting a significant reduction in variance with minimal gain beyond this point. Additionally, the silhouette score, which assesses cluster cohesion and separation, was calculated for the same range of k-values. Although the highest score was at k = 2, k = 5 provided a reasonable balance between well-defined clusters, as well as the granularity necessary for practical application to regional crime analysis. This informed our choice of setting k = 5, allowing for detailed yet meaningful groupings within the data.
The K-means clustering analysis revealed five distinct groups of counties exhibiting similar crime trends, the summary is provided in
Table 2. Time series plots visually display these trends for each crime type within the clusters, providing a comprehensive view that supports targeted public safety strategies. These findings are detailed through visualizations and are summarized in the corresponding tables.
Figure 6a–o illustrates the time series analysis of robbery, murder, and rape rates across the clusters. Each figure provides detailed trends, aiding the interpretation of clustering outcomes.
3.4.6. Contextual Analysis of Crime Rates Across Maryland Counties: Total and Average Crime Comparisons
This section explores the variations in crime rates across Maryland counties, with a focus on specific crime types such as aggravated assault, robbery, rape, murder, motor vehicle theft, and larceny theft. The analysis uses normalized crime rates for each county, providing a comprehensive overview of crime distribution patterns. Choropleth maps are utilized to visualize these rates, highlighting regional disparities and offering insights into crime dynamics across the state. General findings shown in
Table 3 include:
Urban centers: Higher rates of assault, robbery, and motor vehicle theft are concentrated in more urbanized regions such as Baltimore City and Prince George’s County.
Rural counties: Generally, these have lower rates across most crime types, with some exceptions like motor vehicle theft in specific counties.
Variability: Notable variability in crime rates like rape is seen across different counties, reflecting the diverse regional dynamics.
Figure 7 offers a series of maps that provide a comprehensive visual representation of the variation in crime rates across Maryland counties. Each map highlights significant geographic disparities in specific crime types, offering insights into their distribution throughout the state. The first map (
Figure 7a) focuses on aggregated assault rates, revealing how this crime is more prevalent in certain counties, illustrating the uneven spread across Maryland. Following this, the second map (
Figure 7b) details the variation in robbery rates, showing those areas that are disproportionately affected by this type of crime. Similarly, the third map (
Figure 7c) portrays the disparities in rape rates across the state, highlighting regions with particularly high incidences and underscoring the varied landscape of this serious offense. The fourth map (
Figure 7d) displays the distribution of murder rates, indicating areas with higher occurrences, which emphasizes the critical nature of targeted interventions in these regions. The fifth map (
Figure 7e) shows motor vehicle theft rates, providing a clear view of where this crime is more or less common and reflecting the challenges faced by different counties. Lastly, the sixth map (
Figure 7f) examines the rates of larceny theft, further illustrating the notable differences in crime rates from one county to another. Together, these maps serve as a powerful tool for visualizing the geographic spread of crime in Maryland, reflecting the unique challenges faced by different counties and supporting the development of targeted interventions and policies to address specific regional needs.
3.5. Train–Test Split
To evaluate the model’s generalizability and robustness, multiple train–test split ratios (80–20, 75–25, and 85–15) were tested. These splits ensured that the model’s performance could be assessed under varying conditions, balancing the training and testing datasets to reflect real-world scenarios. The 80–20 split consistently yielded the most balanced results, achieving R-squared values exceeding 0.90 across most models. The split provided sufficient data for training while maintaining a robust test set for evaluation. The 75–25 split favored models like neural networks and XGBoost, which demonstrated improved generalization with a slightly larger test set—the increased test size allowed for a more rigorous assessment of these models’ predictive capabilities. Meanwhile, the 85–15 split achieved high R-squared values for certain models (e.g., random forest and gradient boosting), but the reduced test set led to slightly diminished generalization, highlighting the trade-off between training and testing data size. Overall, the results indicate that the 80–20 split offers the best balance for most models, while the 75–25 split may be preferred when prioritizing generalization for complex models like neural networks.
All train–test splits were drawn from the available historical dataset (2012–2023), with no separate future time period held out for forecasting. The model’s performance is, therefore, evaluated on held-out contemporaneous data from this timeframe rather than on any data beyond 2023. In practical terms, the current phase of the study makes predictions only within the period of the observed data; it does not extend to forecasting future crime rates. This approach ensures that evaluation is based on known outcomes. While the methodology is forward-compatible (i.e., it could be applied to predict future crime rates, given appropriate historical training data), implementing such forecasts is beyond the scope of the present study.
3.6. Principal Component Analysis (PCA) for the Crime Rate Index
To enhance the analytical rigor of the study while simplifying the crime data representation, principal component analysis (PCA) was judiciously applied post-train–test split to ensure that no data leakage occurred. The process began with a meticulous standardization of all crime rate variables using Z-score normalization, a crucial step to ensure equitable contributions to the PCA without undue influence from variables with greater magnitude.
Subsequently, PCA was implemented solely on the training dataset, reducing dimensionality while capturing the most explanatory variance. This analysis unveiled a primary principal component that encapsulated dominant crime trends across various types. The loadings from this component suggested that high-impact crimes such as robbery and aggravated assault were most strongly predictive of regional crime disparities, particularly in urban settings.
To aid in the practical application of these findings, the resultant crime rate index was normalized between 0 and 1. This normalization not only facilitated comparisons across diverse Maryland counties but also enhanced the interpretability and utility of the index for policy-making and strategic law enforcement deployment.
The index has proven to be an efficacious predictor of crime trends, effectively reducing the complexity of the model while maintaining high predictive accuracy. By integrating PCA in this manner, the study offers a robust, streamlined view of crime dynamics, supporting the development of targeted, data-driven public safety initiatives across the state. The careful application of PCA in a scenario constrained to training data, corroborated by methodological transparency, establishes a solid foundation for the predictive model, ensuring its relevance and reliability in real-world applications.
3.7. Model Development
This study employs an expansive array of advanced machine learning models to analyze the complex crime patterns found across Maryland, focusing particularly on their adaptability to non-linear and high-dimensional data. Our selection includes diverse models—random forest, gradient boosting, XGBoost, neural networks, the extra trees regressor, support vector machines (SVRs), and a stacking regressor—each known for robust performance in varied analytical contexts, which is essential for deriving nuanced insights across different urban and rural settings.
Ensemble Methods: Random Forest and Extra Trees Regressors
Both the random forest method and extra trees regressors utilize multiple decision trees to enhance the model’s predictive accuracy and ensure generalization across different contexts. The random forest method reduces overfitting by constructing each tree from a random subset of data and features, thereby proving effective even in noisy data environments. It also offers valuable insights through feature importance metrics, highlighting key factors influencing crime rates. In contrast, the extra trees regressor builds on the random forest methodology by training each tree on the entire dataset and selecting the split points randomly, which not only increases randomness but also significantly reduces model variance, enhancing the stability and reliability of predictions.
Boosting Methods: Gradient Boosting and XGBoost
The gradient boosting method and XGBoost implement a sequential approach to decision trees, where each tree incrementally corrects the errors of its predecessors, focusing particularly on challenging cases to enhance the model’s overall accuracy. Gradient boosting is valued for its adaptability across various loss functions and its extensive hyperparameter tuning capabilities, making it particularly effective for intricate crime datasets. XGBoost enhances these features by incorporating advanced regularization to prevent overfitting and system-level optimizations to improve performance, making it exceptionally well-suited for handling structured crime data, along with diverse socioeconomic and demographic features, with high precision.
Deep Learning: Neural Networks
Neural networks excel when modeling complex and non-linear relationships within large datasets. By employing multiple layers of interconnected neurons, these models uncover intricate patterns and interactions among predictors that might elude traditional models. For crime prediction, the networks adeptly integrate temporal trends, spatial distributions, and demographic data, providing a flexible and powerful tool for revealing subtle dynamics in crime occurrences. Although they offer less interpretability compared to tree-based models, their comprehensive assimilation of varied input variables is invaluable for in-depth crime analysis.
Support Vector Machines (SVR) and the Stacking Regressor
SVR is included for its robust performance in high-dimensional spaces and its ability to model nonlinear relationships using kernel functions, capturing intricate patterns in crime data that may be overlooked by other models. The stacking regressor, which aggregates predictions from several base models like random forest, gradient boosting, and XGBoost under a final estimator, notably enhances overall prediction accuracy by blending diverse model strengths. This meta-modeling strategy is crucial for achieving superior predictive performance by effectively synthesizing various learning algorithms.
The deployment of these models aligns perfectly with our objectives, not only to achieve high predictive accuracy but also to ensure that the results are practically interpretable, a vital aspect for supporting informed public safety strategies and policy recommendations. The robustness of these models in varied settings, coupled with their ability to balance computational efficiency with interpretative clarity, makes them ideal for predictive tasks in crime analysis.
Hyperparameter Optimization and Model Integration
Our study employed rigorous hyperparameter tuning using GridSearchCV across multiple machine-learning models to optimize their performance for crime data analysis. Key models used included random forest, gradient boosting, XGBoost, and CatBoost, each tailored to address specific challenges in modeling crime patterns.
For the random forest model, we experimented with a range of n_estimators (100, 200, and 300), max_depth (None, 10, and 20), and min_samples_split (2, 5, and 10), aiming to fine-tune the model’s complexity and enhance its generalization capabilities while avoiding overfitting.
Gradient boosting was optimized by adjusting n_estimators (100, 200, and 300), learning_rate (0.01, 0.05, and 0.1), and max_depth (3, 5, and 7). This setup ensured that each successive tree that was built incrementally improved upon the previous ones, thereby enhancing the model’s accuracy and efficiency.
XGBoost underwent detailed tuning for learning_rate (0.01 and 0.1), max_depth (3, 5, and 7), and subsample (0.8, 0.9, and 1.0), leveraging its advanced regularization to minimize overfitting and maximize performance.
CatBoost was similarly fine-tuned, focusing on depth (4, 6, and 8), iterations (100, 200, and 300), and learning_rate (0.01 and 0.1) to optimize its processing of categorical data and intricate dataset interactions.
Furthermore, we incorporated advanced ensemble techniques to enhance the model’s predictive accuracy. The voting regressor integrated outputs from various models to stabilize predictions by reducing variance. In contrast, the stacking regressor applied a meta-model to exploit the diverse strengths of base models, significantly boosting the overall model efficacy. These strategic enhancements ensured that each model and ensemble technique not only performed optimally on its own but also contributed to a robust, comprehensive predictive framework. This approach underscores our commitment to employing advanced machine learning to generate actionable insights, thereby influencing policy and enhancing public safety effectively.
3.7.1. Cross-Validation for Robust Model Assessment
A detailed fivefold cross-validation process was rigorously applied to test the effectiveness and generalizability of our models across diverse data subsets. This method not only confirmed the models’ robustness but also ensured their reliability for practical applications. The choice of fivefold cross-validation, specifically, was driven by its ability to provide a balanced assessment, reducing both variance and bias. Each fold significantly contributed to tuning the model parameters and selecting the most robust model, addressing potential data variability and imbalance. This approach maximized the use of the dataset, ensuring that each data point was utilized in both training and validation, which minimized biases in model evaluation and enhanced the findings’ applicability and generalizability.
3.7.2. Performance Metrics and Feature Engineering
Models were evaluated on R-squared and mean squared error metrics, with most models achieving R-squared values above 0.90, indicating superior predictive power. Enhancements in predictive capability were driven by sophisticated feature engineering techniques, including the creation of interaction terms and recursive feature elimination (RFE) to pinpoint crucial predictors and streamline model inputs.
The comprehensive application of these models and methodologies underpins our ability to provide detailed, data-driven insights into crime prevention and policymaking across Maryland.
3.7.3. Reproductivity
To ensure reproducibility, our study meticulously details all the parameters and configurations used across various models. In our study, we meticulously documented all parameters and configurations for each model to ensure reproducibility.
Random forest: The model utilized 200 trees with a maximum depth of 20 and a minimum sample split of 5. This balance aims to optimize complexity against the risk of overfitting.
XGBoost: The model was configured to achieve optimal performance with 300 estimators, a learning rate of 0.1, a max depth of 3, and a subsampling rate of 0.8. These settings were refined through an extensive GridSearchCV process involving 405 fits.
Extra trees regressor: This was operated with 200 trees and a maximum depth of 10. This model demonstrated its efficacy by achieving an R2 score of 0.93, reflecting its capability to manage high-dimensional and complex datasets.
Neural network: The network was configured with varying layers and parameters across different scenarios. Notably, in one scenario involving advanced dropout and batch normalization settings, it reached an R2 score of 0.88, showcasing its capacity to adapt and model complex interactions effectively.
3.7.4. Scope of Predictions
It is important to clarify that all model development and validation in this study was conducted on historical data up to 2023. The predictions generated by these models are confined to the data within this period, meaning that the models are not yet used to project crime rates beyond the timeframe of the dataset. This design decision ensures that we are evaluating model accuracy against known outcomes (historical crime data), rather than attempting speculative future predictions. While the modeling framework is forward-compatible and could be adapted for true time-series forecasting (using past data to predict future crime rates), such an application lies outside the scope of the current phase. Future work will address this lack by extending the model to forecast crime trends in upcoming years.
3.8. Model Comparisons and Performances
To robustly evaluate model performance and guard against overfitting, multiple train–test splits were employed. In addition to a conventional 80/20 split, we assessed each model under 70/30, 75/25, 80/20, and 85/15 train–test splits. This approach provides insight into model robustness across varying training set sizes and data partitions. By comparing performance across these splits, we can identify models that consistently perform well (high average R
2 and low RMSE) and that exhibit low variance in metrics, indicating stable generalization.
Table 4 summarizes the mean R
2 and RMSE achieved by each model across the four splits, along with the variance of these metrics as an indicator of stability. The performance of various models is shown across different train–test splits; the values are the mean R
2 and RMSE across the 70/30, 75/25, 80/20, and 85/15 splits. The variance of each metric across the splits is included to illustrate model stability (lower variance indicates more consistent performance across the different data splits).
As shown in
Table 5 and
Figure 8, the linear models (ordinary linear regression and its regularized variants) yielded comparatively poor performance. Their average R
2 scores remain modest (only 0.45–0.58) with relatively high RMSE values (on the order of 13–17 in error). In particular, lasso regression underperformed the most noticeably, achieving an R
2 of just 0.45—markedly lower than the ridge model or OLS—and an RMSE of around 17.5, indicating substantial prediction errors. The lasso model’s aggressive feature selection (driving many coefficients to zero) likely led to underfitting; even with its best-tuned regularization parameter (α ≈ 0.1, see
Table 6), it failed to capture enough of the variance in the data. In contrast, ridge regression (with a moderate α ≈ 0.5) retained more predictive features and attained slightly higher accuracy (R
2 ≈ 0.58), although it, too, fell short of the more complex models’ performance figures. Overall, the linear approaches struggled to model the complex relationships in the dataset, as evidenced by their lower R
2 and higher RMSE values.
The tree-based ensemble models and other advanced regressors dramatically outperformed the linear models, achieving both a better fit and lower variability across splits. For instance, the random forest model reached an average R2 of about 0.85, with an RMSE near 8.4, a substantial improvement over any linear model. The gradient boosting and XGBoost models likewise showed strong performance (R2 ≈ 0.80–0.83, RMSE 9–10), although the CatBoost algorithm slightly edged them out (R2 ≈ 0.84). Extra trees (an ensemble of extremely randomized trees) performed comparably to the random forest model (R2 ≈ 0.82). Notably, these ensemble approaches not only achieved higher predictive accuracy but also exhibited lower variance in R2 and RMSE across the different data splits. For example, the random forest model’s R2 variance across the four splits was only about 0.003, compared to 0.012 for the lasso model. This suggests that the random forest model’s performance was consistently high and less sensitive as to which subset of data was used for training, whereas the lasso model’s results fluctuated more—an indication that the linear model was less robust. Among non-tree models, the support vector regression (SVR with an RBF kernel) also yielded better results than linear regression (R2 ≈ 0.75; RMSE ≈ 12.3), although it did not reach the accuracy levels of the ensemble tree models. The SVR’s performance variance was moderate, reflecting some sensitivity to data splits (likely due to the need to tune the kernel parameters for different data subsets). Finally, the stacking ensemble proved to be the top performer: by combining multiple algorithms (in our case, random forest, XGBoost, and SVR as the base learners, with a ridge regression meta-learner), the stacking regressor achieved the highest overall R2 (0.88) and the lowest RMSE (~7.0) among all the tested models. This stacked model effectively leveraged the complementary strengths of its constituents and it maintained a very low variance in performance (R2 variance~0.002), indicating excellent stability across the various train–test splits.
To ensure that these complex models did not overfit, a rigorous cross-validation and hyperparameter tuning procedure was employed. For each model, we performed an extensive grid search over key hyperparameters, using k-fold cross-validation (typically k = 10) on the training data of each split. This means that model configurations were chosen based on their average validation performance on k-folds, rather than just on training performance, which guards against selecting an overly complex model that performs well on training data but poorly on unseen data. For example, the best random forest model was found to have a maximum tree depth of 10 and around 100 trees (estimators)—a configuration that balances model complexity and generalization. Deeper or unbounded trees could memorize the training data, but cross-validation revealed that a depth ≈ 10 was optimal, likely because this prevents overfitting the smaller training sets. Similarly, the gradient boosting models (XGBoost, CatBoost, etc.) were tuned with a learning rate of ~0.1 and moderate tree depths (3–6), with early stopping rounds or regularization applied to curb overfitting (see
Table 6 for details). The SVR model required tuning of the kernel hyperparameters (e.g., using an RBF kernel with C ~10 and γ ~0.1 was found to be best). Each model’s chosen hyperparameters, along with its resulting test performance, are detailed in
Table 5 and
Table 6. By selecting model settings based on the cross-validation performance, we ensured that each algorithm’s capacity was appropriately constrained. This is reflected in the relatively low variance of the test scores across different splits—the models tuned in this manner maintained a stable performance, which is a strong indication that overfitting was minimized.
To further assess the generalizability and robustness of each model, we analyzed the variation in the root mean square error (RMSE) across different train–test split ratios (80–20, 75–25, 70–30, and 85–15). As shown in
Figure 9, linear models such as linear regression and ridge regression exhibited relatively higher RMSE values across all splits, with noticeable increases at the 85–15 split, suggesting limited adaptability to the reduced training data. In contrast, tree-based ensemble models, particularly the extra trees model and CatBoost, consistently achieved the lowest RMSE values with minimal fluctuation across all splits, indicating strong resistance to overfitting and excellent predictive stability. The stacking regressor and gradient boosting also demonstrated robust performance, maintaining low RMSEs with limited sensitivity to changes in data partitioning. Lasso regression, on the other hand, performed poorly across all configurations, further supporting its unsuitability for modeling the non-linear, high-dimensional structure of the dataset. These results validate the finding that our top-performing models not only offer high predictive accuracy but also sustain performance across varied data availability conditions, reinforcing their reliability for real-world crime prediction tasks.
In addition to the tree-based and ensemble methods, several neural network architectures were evaluated to gauge their performance and generalizability. For the implementation of a neural network model tailored to crime data analysis, we deployed a comprehensive approach that includes several sophisticated techniques to enhance its performance. Utilizing a multilayer architecture, each with dense layers paired with dropout regularization and batch normalization, we aimed to manage the complexity and potential overfitting effectively. Notably, the incorporation of EarlyStopping and ReduceLROnPlateau callbacks played a critical role in our training strategy. EarlyStopping halted training once the model ceased showing performance improvements, thereby preventing overtraining, while ReduceLROnPlateau dynamically adjusted the learning rate in response to training progress, optimizing the model’s learning phase and leading to improved predictive accuracy and efficiency.
Table 7 summarizes the results of the different neural network configurations.
Based on these robust performance metrics and comprehensive validation checks, our study identifies the stacking regressor as the chosen model. Before diving into the comparative analysis of various predictive models, it is crucial to highlight the statistical reliability and robustness of the stacking regressor. We conducted several diagnostic tests to verify its adherence to key regression assumptions. The rainbow test showed no evidence of nonlinearity, as evidenced by a non-significant p-value of 0.298. Similarly, the Durbin–Watson statistic of approximately 2.07 indicated no significant autocorrelation among the residuals. Although the Breusch–Pagan test did reveal some heteroscedasticity, with a p-value of 0.005, this did not significantly detract from the model’s validity, considering its strong cross-validated performance. Additionally, the Spearman correlation analysis ruled out any extreme multicollinearity among predictors, thereby affirming the model’s stability and interpretability. Furthermore, to ensure that our models were not overly complex or specifically tailored to our dataset, we implemented strategies to act as pruning techniques for tree-based models, regularization parameters in linear models, and early stopping in gradient boosting. These measures help prevent overfitting while maintaining model accuracy and generalizability. The results confirm that the stacking regressor, chosen for its high accuracy, meets the essential assumptions for reliable regression analysis and is, thus, well-suited for further deployment in predictive tasks.
In summary, our comprehensive evaluation across multiple train–test splits underscores the comparative robustness and superior performance of advanced predictive models over their linear counterparts. The linear models, while straightforward and interpretable, consistently demonstrated lower predictive capabilities, affirming the presence of significant nonlinear complexities within our dataset that these models fail to address. Conversely, tree-based ensembles and gradient boosting methods have not only achieved marked improvements in accuracy—surpassing linear models by a margin of 0.25 to 0.40 in R2 scores—but have also maintained low variance in these metrics across various splits, highlighting their strong generalizability.
In particular, the stacking ensemble has distinguished itself as the most effective model in terms of raw accuracy. This model’s success is attributed not only to its high performance but also to its consistent results across different dataset partitions, which speaks to its robustness against overfitting—a potential concern for complex models. This robustness was ensured through rigorous cross-validation techniques that effectively identified and corrected those models that were overly fitted to specific segments of data.
The application of these models to contemporary crime datasets has proven highly promising, setting a solid foundation for future predictive tasks. This aligns with our study’s goals of utilizing advanced analytical techniques to enhance crime forecasting and prevention strategies. As we move forward, the stacking model, in particular, will serve as a cornerstone for ongoing research and application in this field, promising not only theoretical insights but also practical benefits in terms of public safety and policy formulation.
4. Discussion
This study integrates advanced machine learning techniques with spatial and temporal crime analysis, offering a nuanced understanding of crime dynamics across Maryland counties. By examining crime rates through clustering, predictive modeling, and socioeconomic correlations, this research provides actionable insights into regional crime trends and their underlying determinants. The key findings will be discussed in the following sections.
Urban and Rural Crime Dynamics: Urban centers, particularly Baltimore City and Prince George’s County, consistently exhibited higher rates of violent crimes such as aggravated assault, robbery, and murder. These findings highlight the socioeconomic challenges faced by urban areas, including concentrated poverty, unemployment, and limited access to essential resources. These dynamics align with the findings of the existing literature, which emphasizes the role of structural inequities in perpetuating urban crime [
57]. Addressing these challenges requires sustained, multifaceted interventions such as enhanced policing, economic revitalization, and community support programs.
In contrast, rural counties display sporadic spikes in crime rates, which are often linked to localized factors such as economic stress, community dynamics, and limited law enforcement capacity. These findings emphasize the limitations of uniform crime prevention strategies, suggesting the need for adaptive, community-specific approaches tailored to rural areas [
61].
Clustering Analysis and Regional Insights: The clustering analysis revealed distinct regional crime trends and provided valuable insights for targeted interventions:
Cluster 0 counties (e.g., Anne Arundel and Montgomery): These counties exhibited gradual declines in robbery rates but saw recent increases in rape and murder rates. This variability indicates a shifting landscape of crime that necessitates a re-evaluation of resource allocation and prevention strategies.
Cluster 2 (Baltimore City): Persistently high crime rates across all categories reflect the structural and systemic challenges faced by metropolitan regions. These findings reinforce the need for long-term, integrative policies addressing socioeconomic inequities and systemic vulnerabilities.
Other clusters: Rural counties, such as those in Cluster 4, demonstrated lower overall crime rates but showed periodic spikes in specific categories, such as motor vehicle theft. These patterns highlight the importance of regional and localized interventions.
The visualization of these clusters using choropleth maps provides policymakers with an intuitive and granular understanding of regional crime dynamics, facilitating the evidence-based prioritization of resources and interventions.
Machine Learning and Predictive Accuracy: The study demonstrated the efficacy of ensemble machine learning models, such as random forest and gradient boosting, for predicting crime rates with high accuracy (R-squared > 90%). These models effectively captured the non-linear relationships between socioeconomic factors and crime trends, offering a scalable framework for predictive crime analysis in other regions.
Furthermore, the integration of principal component analysis (PCA) enabled the development of a composite crime index, which streamlined the analysis without sacrificing critical information about individual crime types. This index proved instrumental in terms of cross-county comparisons, enhancing the interpretability of the results and providing a robust tool for identifying high-crime areas.
Enhancements in Predictive Power through Feature Engineering: Recursive feature elimination (RFE) was employed to identify the most impactful predictors from an extensive dataset encompassing socioeconomic, demographic, and crime-specific variables. This method systematically evaluates the contribution of each feature to model performance by iteratively removing the least significant predictors and re-training the model. The importance of the selected features was validated through ensemble models, ensuring robustness in feature selection, the results are shown in
Figure 10.
Key Findings from RFE Application:
Unemployment rate: This is strongly correlated with increased rates of property crimes such as burglary and larceny, particularly in urban counties.
Population below the poverty line: This factor is highlighted as a critical driver for aggravated assault and robbery rates.
Domestic and international migration trends: These trends played a significant role in predicting regional variations in crime spikes, especially in rural clusters.
Median household income: This is directly linked to the overall crime index, showcasing the disparities in socioeconomic conditions across counties.
By prioritizing these variables, RFE application contributed to the development of a streamlined, interpretable model that minimized redundancy while preserving predictive accuracy. By leveraging RFE, this study not only enhanced model interpretability but also facilitated actionable insights for policymakers. For instance, the identification of migration trends as a key predictor underscores the need for localized community support programs in counties experiencing high influxes of residents. Similarly, the strong influence of economic variables emphasizes the importance of integrating socioeconomic revitalization efforts into crime prevention strategies.
5. Conclusions
The methodology employed in this research effectively integrates traditional statistical approaches with advanced machine learning algorithms to develop a robust model for predicting crime rates in Maryland counties. The inclusion of diverse socioeconomic indicators such as unemployment rates, migration patterns, and income levels enhances the model’s ability to capture the intricate relationships between these variables and crime trends. This multi-faceted approach enables a comprehensive understanding of the factors driving crime in both urban and rural settings. The findings offer several actionable insights for policymakers and practitioners.
Age-specific Crime Interventions: Individuals aged 30–50 are disproportionately affected by property crimes, while younger demographics (20–40) face heightened risks of violent crimes, such as aggravated assault, rape, and auto theft. These findings call for targeted prevention strategies, including economic support and job creation programs for younger populations and enhanced property protection measures for older age groups.
Gender and Racial Disparities: This study confirms significant gender and racial disparities in victimization, with males being disproportionately affected by violent crimes like shootings and aggravated assault and with minority groups experiencing higher rates of violent victimization. These findings emphasize the need for tailored interventions, such as strengthening community policing in minority neighborhoods and expanding support services for at-risk populations.
Urban vs. Rural Crime Trends: The distinct crime patterns between urban and rural areas suggest the necessity of differentiated resource allocation. Urban areas like Baltimore City require sustained, long-term interventions to address persistently high violent crime rates, while rural counties benefit more from adaptive, community-specific strategies that address localized crime drivers.
Modeling Performance and Scalability: The neural network model consistently outperformed other machine learning techniques, achieving an R-squared value of over 90%. The ensemble methods further enhanced the model’s predictive accuracy, reaching 0.95 in the 85–15 training–testing split. These results underscore the importance of employing advanced algorithms for high-dimensional, non-linear datasets and demonstrate their applicability in crime analysis across diverse contexts.
This research contributes to the growing field of data-driven crime prevention studies by combining socioeconomic insights with state-of-the-art predictive modeling. By addressing the complexities of crime dynamics in urban and rural settings, the study offers a scalable and adaptable framework for other regions. Future efforts should focus on integrating real-time data sources and expanding the model’s applicability to account for evolving socioeconomic and environmental conditions. These advancements will further enhance the ability of policymakers to develop informed, equitable, and effective crime prevention strategies.