Entity Extraction of Key Elements in 110 Police Reports Based on Large Language Models

Xing, Xintao; Chen, Peng

doi:10.3390/app14177819

Open AccessArticle

Entity Extraction of Key Elements in 110 Police Reports Based on Large Language Models

by

Xintao Xing

and

Peng Chen

^*

School for Information and Cyber Security, People’s Public Security University of China, Beijing 100038, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7819; https://doi.org/10.3390/app14177819

Submission received: 8 July 2024 / Revised: 28 August 2024 / Accepted: 2 September 2024 / Published: 3 September 2024

Download

Browse Figures

Versions Notes

Abstract

:

With the rapid advancement of Internet technology and the increasing volume of police reports, relying solely on extensive human labor and traditional natural language processing methods for key element extraction has become impractical. Applying advanced technologies such as large language models to improve the effectiveness of police report extraction has become an inevitable trend in the field of police data analysis. This study addresses the characteristics of Chinese police reports and the need to extract key elements by employing large language models specific to the public security domain for entity extraction. Several lightweight (6/7b) open-source large language models were tested as base models. To enhance model performance, LoRA fine-tuning was employed, combined with data engineering approaches. A zero-shot data augmentation method based on ChatGPT and prompt engineering techniques tailored for police reports were proposed to further improve model performance. The key police report data from a certain city in 2019 were used as a sample for testing. Compared to the base models, prompt engineering improved the F1 score by approximately 3%, while fine-tuning led to an increase of 10–50% in the F1 score. After fine-tuning and comparing different base models, the Baichuan model demonstrated the best overall performance in extracting key elements from police reports. Using the data augmentation method to double the data size resulted in an additional 4% increase in the F1 score, achieving optimal model performance. Compared to the fine-tuned universal information extraction (UIE) large language model, the police report entity extraction model constructed in this study improved the F1 score for each element by approximately 5%, with a 42% improvement in the F1 score for the “organization” element. Finally, ChatGPT was employed to align the extracted entities, resulting in a high-quality entity extraction outcome.

Keywords:

large language models; police reports; information extraction; data enhancement; fine-tuning

1. Introduction

With the rapid development of digitalization and information technology, focusing on the construction of big data for public security in China has become the trend and focus of current public security informatization efforts [1]. Police incident data, as a core component of the public security big data system, play an irreplaceable role in enhancing the quality, efficiency, and response speed of public security work through effective analysis and mining. A police incident refers to emergent events that require the intervention of public security agencies following social public order and crime incidents, which typically endanger public safety [2]. By centrally analyzing police incident data and constructing a police incident graph, it is possible to unearth potential relationships between elements, thereby facilitating crime prediction, police force distribution planning, and case linking studies. This significantly enhances the overall efficacy of public security operations and promotes the development of new police service models; accurate and comprehensive extraction of key elements from police incidents is a prerequisite for analyzing these incidents, providing necessary conditions and assurances for subsequent detailed analysis. Assuming we analyze a large amount of police data on theft cases and extract key elements (such as behaviors, items, locations, and organizations) from these cases, we can uncover some potential patterns. For example, in specific areas (such as Xicheng District, Beijing), there is a higher frequency of theft behaviors (such as lock-picking) and the theft of specific items (such as valuable electronic devices). Additionally, certain organizations (such as criminal gangs) are more active in these areas. By identifying these patterns, law enforcement agencies can deploy police forces more effectively in high-risk areas, enhancing patrol and surveillance to prevent crime. These patterns also provide an important basis for crime prediction, helping law enforcement agencies take preventive measures before incidents occur. For example, by predicting the crime risk in a specific area over a future period, the police can pre-arrange their forces, thereby improving the efficiency of police resource utilization.

However, the heterogeneous nature of police incident data from multiple sources presents traditional manual analysis methods with challenges such as time-consuming processes, low efficiency, and difficulty in associative analysis [3]. As the volume of police incidents continues to increase, there is a lag in advancing data utilization capabilities; current police incident analysis often relies solely on experience accumulation or apprenticeship transmission, depending on manual analysis and traditional technological means to extract entities from police incidents, such as using dictionaries or models like LSTM and Bert to extract key elements [4,5,6,7,8,9,10,11]. While these traditional methods have achieved certain results, the models used lack upgradability and generalizability, are numerous and varied, and do not meet the requirements for detailed analysis of police incidents; detailed analysis of police incidents requires certain semantic understanding capabilities. Taking the personnel organizations involved in police incidents as an example, relying solely on traditional models to extract keywords such as “teaching, giving lessons” does not fulfill the demands of final analysis, necessitating further analysis and summary by police personnel; the models currently used for extracting police incident elements lack the capability to directly extract semantic understanding-based organizational labels such as “teacher”.

From the perspective of police incident data, the amount of data generated each year is vast, but there is a scarcity of labeled data with tag information; moreover, due to the diversity and complexity of the content types contained in police incident data, the quality of the data is low and the dimensions are high. Building a training set for the extraction of key elements from police incidents requires domain-specific knowledge and operational experience from public security professionals, while also considering data privacy and security. In the context of the aforementioned public security operations, how to improve the effectiveness of key element extraction from police incident information is the focus of this paper.

With the development of emerging technologies, especially large language models (LLMs), which demonstrate exceptional capabilities and powerful generalization in natural language processing tasks, these models have become a significant force driving research and application trends. How to efficiently extract key information from vast and complex data using these technologies has become a hot and challenging topic in the field of public security police incident information processing [12]. Currently, there are many studies on applying large language models to information extraction, achieving results superior to traditional models. Additionally, numerous studies have focused on building specialized domain-specific large language models for particular fields, helping to address unique issues within those domains [13,14,15,16,17,18,19,20,21].

This paper first explores the application of large language model (LLM) technology in the field of police incident element entity extraction. Given the sensitivity of police incident data and the limitations of current open-source models, we propose a method suitable for the public security sector in China. Initially, the key elements to be extracted from police incidents are categorized and defined, with the key elements involved in police incidents divided into four categories: key actions, key items, organizations, and locations, based on the relevant literature and expert experience, and each category is defined and explained in detail. Subsequently, we enhance the application of LLM technology in police incident element entity extraction from three main directions: prompt engineering, data engineering, and the selection and fine-tuning of open-source base models, while designing or improving methods specific to the public security domain. In the prompt engineering part, a prompt enhancement method is used to select optimal prompts to better leverage the capabilities of large language models; in the data engineering part, a method is proposed that utilizes a powerful proprietary large language model to enhance the capabilities of local open-source large language models, specifically a zero-sample police incident data augmentation method based on ChatGPT, which expands data without revealing privacy to enhance the large language model’s effectiveness in extracting element entities. After extracting the elements, a vectorization approach with ChatGPT is proposed to align the extracted element entities. Ultimately, we designed and implemented a method for extracting element entities from police incidents based on open-source large language models, specifically for the public security domain.

The structure of this paper is as follows: Section 1 serves as the introduction, Section 2 reviews relevant research on police incident information extraction and the application and progress of large language models in the field of information extraction; Section 3 details the research methods, including the specific definitions of police incident elements, prompt engineering, data construction and enhancement methods, and the selection and fine-tuning of base models; Section 4 verifies the effectiveness of these methods through experiments and analyzes the experimental results; Section 5 discusses the theoretical and practical significance; and Section 6 concludes the paper and looks forward to future work.

2. Related Work

The related work section is divided into three main parts. First, it addresses the definition of police incident elements and the current research on methods for extracting police incident information. Second, it discusses the development of existing large language models for information extraction. Finally, it examines current methods for enhancing the effectiveness of large language models in the field of information extraction.

2.1. Definition and Extraction Methods of Police Report Elements

The extraction of key elements from police report data is crucial for improving the efficiency and quality of public security work. Firstly, it is necessary to clarify the definition of police report elements, which may vary depending on research perspectives and practical needs. According to smart policing practices, police reports are divided into five elements: person, incident, location, object, and organization. Zhang Lei et al. proposed a more detailed classification, identifying six key features of 110 police reports: clue, element, behavior, scene, organization, and location [2]. The elements of criminal cases are often discussed in terms of five-element theory, seven-element theory, vertical dynamic, and horizontal static elements theory [22]. Weng Dun emphasized five basic elements related to crime: person, incident, object, time, and space, and explored the intrinsic connections and combinations of these elements [23]. Professor Ren Huihua from the Department of Investigation at Southwest University of Political Science and Law proposed that each criminal case has certain elements, mainly including crime time, crime space, crime subject, crime object, and criminal behavior [24]. Hu Xiangyang et al. mentioned that the content of investigative interrogation texts mainly includes seven elements: person, time, place, incident, situation, reason, and object. Additionally, the similarity of cases is calculated based on the similarity of five elements: time, space, tools, methods, and involved objects, with time and space elements often combined into a spatiotemporal element [22].

In the research progress of police information extraction, traditional technical methods have achieved certain results. Li Jing et al. constructed a relationship network of usernames and URLs in cybercrime case files using natural language processing technology [4]. Das et al. utilized graph clustering methods to discover synonyms or phrases and extract relationships between named entities in a crime corpus [5,6]. Wang Mengxuan, Wang Yue et al. optimized key feature extraction for police text classification and named entity recognition by combining improved convolutional recurrent neural networks and BERT-based models [7,8]. Zhang Lei et al. used LSTM and CRF technology and sensitive word recognition to handle the identification of event and behavior characteristics [2]. Chen Yongjun et al. extracted key information such as persons, time, and places from police texts using a combination of “BERT + CNN algorithm + syntactic” rules and established relationship networks [9]. Carnaz et al. used a 5W1H information extraction method and stored the data in the Neo4j graph database to classify and recognize named entities or terms in the field of public security crime [10]. Ning Xinyu et al. improved the accuracy and efficiency of case text segmentation by updating custom dictionaries and stop-word lists [11].

2.2. Development of Large Language Models for Information Extraction

The technology for extracting key elements from police reports has evolved from basic rule-driven methods to complex natural language processing techniques. The rise of generative large language models such as GPT-3 [25] and ChatGPT [26] has introduced innovative possibilities for this field. Large language models have demonstrated excellent performance across various downstream tasks [17], showcasing superior practicality and flexibility over traditional methods in handling structured information extraction tasks [27].

Currently, extensive research has applied large language model technology to information extraction tasks, including the universal information extraction (UIE) model by Lu et al. [13], the instruction-based InstructUIE framework by Wang et al. [14,15], and the unified semantic matching information extraction framework proposed by Luo et al. [16]. These works provide new perspectives and methods for police information extraction. Moreover, studies such as ChatIE [17] and the work by Tang et al. [18] have utilized LLM for zero-shot training and clinical text mining, further proving the effectiveness of large language models in specific domains. However, using internet-connected large language models raises privacy concerns. Since most large language models are accessible only via their APIs, they are unsuitable for domains with confidential data. This issue can be addressed by leveraging open-source models to build domain-specific large language models locally.

Domain-specific large language models are trained or fine-tuned on data from specific fields (e.g., biomedical, legal, or financial), aiming to absorb the expertise, terminology, and style of the target domain to enhance performance in various downstream tasks within that field. These models can be developed through training from scratch or fine-tuning pre-trained large language models, where fine-tuning is a form of transfer learning that provides a good starting point for specific domain applications [19]. For example, BioMedLM, a GPT model trained on PubMed data [20], has achieved advanced results in medical question-answering tasks, while Bloomberg GPT, trained on financial data [21], has excelled in multiple finance-specific tasks.

For the public security domain, constructing domain-specific large language models can address the challenges of police information extraction. However, simply using open-source large language models is insufficient; fine-tuning and other methods tailored to police information are necessary to enhance the model’s effectiveness.

2.3. Optimization Techniques for Large Language Models in Information Extraction

Supervised fine-tuning (SFT) is a technique that uses labeled data to train or adjust pre-trained large language models for specific tasks or domains. Fine-tuning has five main goals: improving model performance, enhancing data utilization, effectively selecting data, ensuring evaluation integrity, and reducing model bias and toxicity [28]. SFT can be employed to improve model performance in natural language generation, question answering, or text summarization tasks by fine-tuning on input–output pairs that contain the target task or domain [29]. In the process of instruction fine-tuning large language models, the quality of the dataset is more important than its quantity. Therefore, recent research has focused on exploring methods to select high-quality subsets from instruction datasets to reduce training costs and enhance the instruction-following capability of LLMs [30].

The effectiveness of fine-tuning large language models largely depends on the task and data, making the selection of the best fine-tuning method for downstream tasks challenging. Potential factors affecting the performance of model fine-tuning include, but are not limited to, pre-training conditions and fine-tuning conditions (such as the nature of the downstream task, the scale of fine-tuning data, and the fine-tuning method). Intuitively, pre-training determines the quality of the representations and knowledge learned in the pre-trained LLM, while fine-tuning affects the transferability of this knowledge to downstream tasks. This can be seen from the following multiplicative joint scaling law for LLM fine-tuning:

\begin{array}{r} \hat{L} (X, D_{f}) = A * \frac{1}{X^{α}} * \frac{1}{D_{f}^{β}} + E \end{array}

(1)

Among them, {A, E, α, β} are data-specific parameters to be fitted, D_f represents the size of the fine-tuning data, and X represents other scaling factors. Experiments have proven that this joint law applies to different environments [31]. This paper integrates the multiplicative joint scaling law [31] and primarily focuses on enhancing the model through a combination of the base model and fine-tuning methods, data engineering, and prompt engineering, building upon existing research.

From the perspective of fine-tuning methods, mainstream parameter-efficient fine-tuning strategies are classified into three categories: adapter, prefix tuning, and low-rank adaptation (LoRA). Adapter techniques insert small neural network modules (adapters) between the pre-trained model layers for fine-tuning, such as AdapterP and Parallel. Prefix tuning techniques add trainable prefix tokens to the model inputs or hidden layers, fine-tuning only these tokens, as seen in P-Tuning and P-Tuning v2. LoRA techniques optimize low-rank matrix parameters to approximate weight matrix updates, enabling efficient fine-tuning and spawning derivatives like AdaLoRA and QLoRA [32]. LoRA [33] has become increasingly popular for specializing pre-trained LLMs to specific domain tasks using minimal training data. LoRA retains the pre-trained model weights and introduces trainable rank decomposition matrices at each layer of the transformer architecture, significantly reducing the number of trainable parameters and allowing for low-cost training of various LoRA models [34].

From the perspective of data engineering, the main methods of data augmentation include rule-based EDA algorithms [35], improving the standard maximum likelihood estimation paradigm [36], and adversarial learning methods: the AWD method, which dilutes the embeddings of strong positive words by mixing unknown word embeddings [37]. Research shows that using zero-shot and few-shot strategies through conversational agents like ChatGPT or Llama can improve performance and surpass classical methods. Studies have proven that the only data augmentation method that consistently improves model performance is generating data similar to the training data (using zero-shot or few-shot) rather than interpreting or modifying the current training set [38].

From the perspective of prompt engineering, zero-shot learning and prompt optimization techniques are continuously evolving. Zero-shot learning allows LLMs to perform untrained tasks, relying on the model’s ability to generate coherent text [39,40]. Prompt optimization includes various design methods, such as prompt optimization principles [41], prompt optimization using data gradients [42], adaptive prompt optimization EvoPrompt [43], multi-round dialogue alignment strategies [44], and PromptAgent [45]. These methods have achieved significant progress in improving model performance. Additionally, new techniques such as Chain of Thought (CoT) [46], Tree of Thoughts (ToT) [47], Chain-of-Symbol Prompting (CoS) [48], and Optimization by Prompting (OPRO) [49] have shown potential in enhancing reasoning capabilities and task adaptability [50].

Surveying current research, as society and technology continually evolve, the definition of police incident elements is in a state of ongoing development and refinement. Proposing and extracting elements that are appropriate for current conditions is essential. Currently, the extraction of key elements from police incidents is still confined to traditional methods. As analytical demands increase, the effectiveness of these extractions fails to meet requirements, and traditional methods cannot provide abstract summaries of extracted elements through semantic understanding. At the same time, large language model (LLM) technology is flourishing, achieving new heights across various traditional natural language processing tasks [21]; it has been successfully applied in general information extraction tasks, and domain-specific large language models are increasingly emerging in various fields. However, in the public security domain, there remains a gap in the practical application of large language models to police work and a lack of comprehensive processes for developing large language models tailored for sensitive areas.

To address the issues mentioned above, this paper applies large language model (LLM) technology to the domain of police incident element entity extraction to enhance the extraction effectiveness of key incident elements. Recent research indicates that smaller models, through targeted training, can match the capabilities of large, parameter-rich models [29,50], offering a technical solution that meets the specific requirements of the public security domain while reducing budgetary constraints. Tailoring to the characteristics of the public security field, we opt for local deployment of open-source large language models, and improve existing information extraction techniques. By testing and fine-tuning different base models and combining data engineering and prompt engineering techniques, we enhance the local model’s capabilities for extracting key incident elements. Finally, using ChatGPT, we align the extracted elements. The approach proposed in this paper not only iterates on existing methods of police incident element extraction but also marks the first application of large language model technology in the public security incident domain.

3. Materials and Methods

This paper aims to construct the optimal model for extracting police incident elements through a series of training strategies. The research methods mainly include four key components: prompt engineering, construction and enhancement of the dataset (data engineering), selection and fine-tuning of the base model, and final alignment of the extracted element entities. The overall process is illustrated in Figure 1:

3.1. Prompt Engineering

In the training processes conducted in this paper, or when interacting with ChatGPT, the design of prompts is a crucial element. The prompts used in this paper have been improved and enhanced, focusing primarily on the selection of training set prompts, data engineering, and entity alignment. Combining the prompt engineering methods from the current related work, the design of prompts starts by clarifying system commands, such as setting the role to “Act as a professional information extraction expert”. This helps to define the goals and roles of the large language model, while also improving the effectiveness of prompts through the adoption of in-context learning (ICL) technology. Furthermore, this paper has established structured templates, such as using multi-level headings like “#Input” and “#Output Format”, and under these headings, input placeholders and output formats are set. Current prompt engineering primarily targets large parameter models, with relatively less research on small parameter models. While employing prompt engineering, this paper also explores the application of common prompt design methods from large parameter models to small parameter models. During the training process, the paper employs a consistent training prompt strategy with query prompts [51], integrating optimization prompt techniques such as COT [46], TOT [47], COS [48], and OPRO [49], and based on the prompt design from the literature organized in Section 2.3, ten new prompts were designed as choices for the dataset prompt, using these prompts to conduct scoring tests for organization extraction on the base model, and ultimately selecting the prompt with the highest accuracy as the official prompt for the training process. Additionally, in data augmentation, prompt design is used to better expand the data; finally, in the entity alignment part, prompt engineering and ICL are employed to facilitate entity alignment using ChatGPT.

3.2. Dataset Construction and Enhancement

Based on the constructed dataset, this paper proposes a zero-shot data augmentation method using ChatGPT (specifically, the GPT-4 model used in this study). The process involves locally processing and desensitizing police data, extracting key element seed words and police report templates, and then inputting them into ChatGPT combined with prompt engineering. During this process, high-quality prompts are constructed by integrating relevant theories from environmental criminology and police expertise to fully utilize the generative capabilities of the large language model, achieving the rapid and efficient generation of a large amount of high-quality simulated police data.

This method mainly consists of four parts: constructing and expanding the seed word library, creating simulated scenarios, constructing police report templates, and generating and verifying police reports. The overall process is illustrated in Figure 2.

3.2.1. Dataset Construction

The data used in this study were obtained through collaboration with the Beijing Municipal Public Security Bureau, sourced from their 110 emergency response platform. We selected 3000 high-risk and medium-risk Chinese police incident reports from 2019 for focused attention and annotation by domain experts with extensive practical experience in police incident analysis and relevant operational expertise. The existing Chinese data were initially filtered and cleaned by removing low-quality and duplicate entries, and unrelated content was eliminated using regular expressions optimized for Chinese text processing. These domain experts then manually extracted elements from each police incident report, resulting in an initial high-quality Chinese language training set of approximately 2500 reports. From this set, 500 Chinese reports were selected as the test set. It is important to note that all data processing, annotation, and analysis were conducted entirely in Chinese, ensuring consistency and avoiding any language-related complexities throughout the study.

In this study, based on the definitions of police report elements and the existing format of police reports, four key elements were extracted: location, key behaviors, key items, and organizations.

Location: Refers to the specific geographic location where the 110 police incident occurred.
Key behaviors: Encompasses 1–5 critical behaviors in the police report, including but not limited to actions and methods mentioned in the report. Common behaviors include actions like cutting, deceiving, and stealing, usually composed of action verbs and related nouns. These behaviors are fundamental to understanding the nature and severity of the incident.
Key items: Includes important items related to the incident, such as common items in 110 police reports like cars, motorcycles, mobile phones, keys, switchblades, bricks, wallets, and iron rods. Item characteristics are crucial in police reports, especially in cases involving property loss or harm, providing essential information for analysis, statistics, and handling of the incidents.
Organizations: Refers to various groups or organizations involved in the incident, either as perpetrators, victims, or related parties. Characteristics of these organizations include the profession or identity of their members, which may include but are not limited to schools, military units, government departments, legal institutions, enterprises, companies, foreigners, journalists, and petition groups. Extracting organization characteristics is crucial for handling police reports, particularly in cases involving special groups, as it provides essential information for formulating response plans, case linking, and analysis.

An example from the training set is as follows:

Police Report Content:

“Mr. Bai reported: In Building X, Unit X, Floor XX of XX Jiayuan, Changping District, a woman is threatening to commit suicide. She is holding the broken bottom of a wine bottle against her neck and appears highly agitated. I am a security guard and am unaware of the reasons behind her suicidal intentions. Currently, she has not sustained any injuries”.

Extracted Elements:

Location: Building X, Unit X, Floor XX of XX Jiayuan, Changping District;
Key behaviors: Suicide;
Key items: Broken bottom of a wine bottle;
Organizations: Security guard.

Both the training set and the final output requirements are designed in JSON format. For instance, taking the above police incident as an example, the extracted organizational elements would be output as [{“organization”: “residents”}]. This output then undergoes further processing to obtain the final extraction results.

High-quality instructional data are key to enhancing the specific functions of large language models [52]. There are four main methods for constructing an instruction fine-tuning dataset: (1) manually created, (2) model-generated, (3) collected and improved from existing open-source datasets, and (4) a combination of these methods [53]. Based on the existing high-quality initial training set, this study enhances the dataset. In order to increase the data size of the training set while maintaining quality, an innovative zero-shot police data augmentation method based on ChatGPT is proposed, leveraging the capabilities of large language models without compromising privacy.

3.2.2. Seed Word Library

First, the zero-shot data augmentation method based on ChatGPT involves determining key elements and inputting them into the large language model to generate police reports. Thus, a seed word library is constructed by combining the extracted police report elements, with each element word serving as a seed word. ChatGPT is then used to expand the seed library, addressing scenarios not present in the training set and improving the generalization of the local model. This allows ChatGPT to generate simulated police reports containing the seed words. The seed word library is carefully constructed to ensure data robustness and avoid model bias towards any specific data subgroup. The original seed words totaled 53, and after expansion with ChatGPT, they increased to 100. Different categories of seed words were then combined and logically tested, resulting in 200 sets of seed words.

3.2.3. Scenario Simulation

In order to improve the quality of the simulation of the police, a scenario simulation of police incidents is constructed. Scenario simulation integrates several key theories from environmental criminology, including rational choice theory, routine activity theory, and crime pattern theory, into the prompts. These theories collectively analyze the decision-making process of offenders in specific environments. In designing prompts for the scenario simulator, environmental elements such as temporal and spatial context variables of the case and individual behaviors are incorporated, mainly divided into time and space attributes. By simulating different environmental contexts in the prompts, the model is better prepared to generate corresponding police reports.

An example scenario combining specific police incidents and related theories might be:

City Center Business District, daytime: “In a busy city center business district, surrounded by upscale shops and office buildings, with a large number of people during midday. Despite the presence of many pedestrians and shop employees (guardians), the target (such as a white-collar worker or tourist) is distracted. Describe a possible crime scenario involving xxx”.

3.2.4. Police Report Templates

Using the existing training set and expert experience, different police report templates are abstracted, with each template corresponding to a specific extraction category. Utilizing prompt engineering technology, scenario simulators, police report templates, and seed words are combined and input into ChatGPT.

In this process, ChatGPT plays two roles:

“Assistant” GPT: Interacts directly with the user to generate and optimize prompts. It automatically generates multiple high-quality prompts designed to be input into the “Worker” GPT to produce a large amount of police report data.

“Worker” GPT: Takes the prompts generated by the “Assistant” GPT and produces multiple detailed police report cases based on them.

The “Assistant” GPT, under the given background settings, incorporates element seed words and templates into the prompt design, optimizing and generating prompts for data generation. These prompts are then used by the “Worker” GPT to generate a large amount of labeled synthetic data. This method allows for the generation of simulated police report data equivalent to three times the original data volume.

3.2.5. Post-Processing

Additionally, this study not only constructs but also expands the training set using ChatGPT. By introducing entity knowledge, police-related information, location, and organization definitions are also integrated into the training data to address the domain knowledge gaps faced by large language models when predicting entity categories. Finally, to enhance the quality and ensure the logical consistency of the generated data, a series of post-processing steps are implemented. This includes constructing GPT modules to identify and eliminate low-quality or duplicate samples generated by ChatGPT. Moreover, the logical consistency of the generated text is examined to ensure the accuracy and reliability of the content.

3.2.6. Summary

In this study, we utilized the “Assistant” GPT to enhance the generation of simulated police reports through prompt optimization techniques and interactive improvements. These enhancements facilitated the production of high-quality synthetic data, which significantly expanded our training dataset. For a detailed description of the prompt optimization process and the interactive improvements implemented, please refer to Appendix A.

Additionally, this paper created a public GPTs module called the “Text Content Verifier” to filter and evaluate the generated police report content. By utilizing the API, the aforementioned process can be automated, thereby expanding the training dataset to four times the size of the initial training set.

3.3. Base Model Selection and Fine-Tuning Methods

In this study, we selected lightweight open-source large language models with 6–7 billion parameters as the base models for subsequent experiments. When choosing the base large language models, we considered the specific needs of public security work, including the need for confidentiality, which necessitates the use of open-source models for local deployment. Considering the ease of deployment during local deployment, we prioritized lightweight 6b/7b models. From the current mainstream open-source model bases and commonly used open-source base models in vertical fields, we referred to reports such as the SuperCLUE benchmark large model evaluation report. Consequently, we selected six open-source models listed in Table 1 for testing.

The primary differences between the various open-source models lie in the training data used, while their underlying architectures are fundamentally consistent, all based on the transformer architecture. There are minor differences in details, such as the method of layer normalization or the inclusion of bias layers.

Through fine-tuning experiments on these base models, the best-performing model is selected after testing and comparison.

Regarding fine-tuning methods, this paper chooses the LoRA [33] technique. LoRA is a technology that significantly reduces the number of parameters required during the model fine-tuning process. Its core idea is based on a key finding: the weight differences between the pre-trained model and the fine-tuned model often exhibit a low-rank distribution, meaning these weight differences can be represented as the product of two smaller matrices. In LoRA fine-tuning, the low-rank adjustment of the original weight matrix WWW is achieved by training two small matrices, AAA and BBB, within the target module. LoRA adapts the model by inserting two sequential low-rank matrices to fit the residual weights. The forward computation of the adapted module is as follows:

y^{'} = y + Δ y = W x + B A x . Here, A \in R^{d \times r}, B \in R^{r \times k}, r ≪ \min (d, k)

The rank (r) of LoRA is usually much smaller than the original model’s dimensions, enabling rapid fine-tuning with only a slight increase in model weights (typically only 0.1% to 1%). This maintains low storage and memory requirements. LoRA is widely applicable, especially in dense projection layers based on the transformer architecture. Through this technique, LoRA can make the fine-tuning process more efficient and cost-effective while preserving the performance of the original model. This efficient fine-tuning method is particularly valuable for large language models that require frequent updates or adjustments for specific tasks.

3.4. Entity Alignment

This study proposes a two-stage entity alignment method specifically designed for aligning entities within the four key elements, processing one type of element at a time while leveraging information from the other elements. The overall process is illustrated in Figure 3.

Stage One: Preliminary Screening

Input: The initial input consists of an entity list from one specific key element among the four. These entities are derived from the extraction results of the large language model and need to be unified and aligned.

Embedding: This step is implemented by using the text-embedding API of ChatGPT. Utilizing ChatGPT’s embedding API allows for effective vectorization while preserving a certain level of semantics [59], which also facilitates the use of a unified model for subsequent detailed screening with ChatGPT. Each entity is input into ChatGPT’s API to generate a vector representation for each entity.

Candidate Entity Set: Based on the embeddings generated by ChatGPT, we calculate the similarity between entities using cosine similarity to measure vector similarity. Through continuous testing, we select entity pairs with a similarity greater than 90% to form the candidate entity set. This method effectively filters out all key elements that need alignment, while significantly reducing the number of entities requiring detailed comparison, thereby enhancing the efficiency of subsequent processing.

Stage Two: Detailed Screening

Context Introduction: For each candidate entity pair, relevant information from the other three key elements is introduced as context to aid the large language model in judgment. This step aims to provide a broader context for each candidate entity pair, ensuring the alignment process considers the entities’ positions and relationships within the overall system.

Prompt Engineering: Based on the combined context, carefully designed prompts are created. These prompts are intended to guide ChatGPT to more accurately understand and handle the entity alignment task, fully utilizing its language comprehension capabilities.

ChatGPT Entity Alignment: The designed prompts and combined context are input into ChatGPT. Utilizing ChatGPT’s strong semantic understanding and reasoning abilities, entities are aligned more precisely. This step can handle complex semantic relationships and subtle differences.

Final Output: The aligned entity results are processed by ChatGPT. To further ensure the accuracy and reliability of the results, a manual review step is introduced. Professionals review the automatically aligned results to verify the correctness of the matches and make corrections if necessary. The result list includes matched entity pairs after deep semantic analysis and context consideration.

The final output is a thoroughly analyzed and aligned entity list. This method makes full use of ChatGPT’s semantic understanding capabilities, not only for generating embeddings but also for the final entity alignment, fully leveraging the power of large language models. It considers the overall relationship of the four key elements without isolating individual entities. The two-stage design ensures both efficiency (preliminary screening) and accuracy (detailed screening), with manual review as the final safeguard to ensure reliability and practicality. This method is particularly suitable for complex entity alignment tasks that require deep semantic understanding and context consideration, effectively handling large-scale, multi-source, heterogeneous entity data.

3.5. Experimental Environment and Evaluation Metrics

The configuration of the experimental environment is shown in Table 2.

In this study, precision (P), recall (R), and F1 score are used to evaluate the results.

To validate the robustness of our results, we performed statistical testing, including confidence intervals and significance testing. All metrics are reported as decimals and computed based on 3 independent trials. We calculated 95% confidence intervals for each metric to provide a measure of precision for our estimates.

4. Results and Analysis

4.1. Baseline

The current state-of-the-art (SOTA) model for information extraction, UIE [13], was selected as the baseline model. This model was fine-tuned using the same training set, and the results after fine-tuning are shown in Table 3. Since current entity extraction models can only label entities based on the content appearing in the text and cannot perform semantic understanding and summarization, the standards were lowered for the baseline experiment. The test set was designed based only on the content that appeared in the text.

The results indicate that in universal information extraction (UIE), the extraction of location, key items, and key actions is relatively better compared to organizations. This is because the first three categories are more general extraction tasks, for which the base model has been specifically trained, hence the better performance. However, the extraction of organization elements, which are particularly specific to police incidents, is less effective. This is primarily due to a lack of relevant pre-training and a deficiency in semantic understanding capabilities.

4.2. Prompt Selection

In our research, we extensively tested different prompts on various models, such as Baichuan, Qwen, and ChatGLM, to identify the most effective one for generating our training set. These tests were guided by the earlier discussed optimization techniques. The best-performing prompt showed a notable improvement in precision and recall rates, significantly enhancing the quality of the generated data. Detailed results from these tests, including the specific prompts and their performance metrics, are provided in Appendix B.

Improving prompts using techniques from prompt engineering has certain effects on the base model, but incorporating complex reasoning chains can lead to disorganized outputs and include content that reflects the thought process, resulting in decreased effectiveness. This is presumed to be due to the relatively weak general capabilities of the 6/7b model. This paper has designed and optimized ten prompts, selecting the most effective ones for the 6/7b base model, which demonstrates the efficacy of prompt engineering. Existing research on prompt engineering, which has been applied to proprietary models with good results, shows that using prompt engineering on the 6/7b models does yield improvements, but not as significantly as on proprietary models. This indicates that the effectiveness of prompt engineering is somewhat correlated with the size of the model’s parameters.

4.3. Base Model Fine-Tuning

After constructing the original training set, the base model was tested for the extraction of the four key elements, yielding the following results: The best results obtained in the table have been highlighted in bold.

As shown in Table 4, for the extraction of location elements, the best-performing base model was Qwen. After fine-tuning, the best-performing models were Baichuan and Yi.

As shown in Table 5, for the extraction of key behavior elements, the best-performing base model was Baichuan. After fine-tuning, the best-performing model remained Baichuan.

As shown in Table 6, for the extraction of key item elements, the best-performing base model was ChatGLM. After fine-tuning, the best-performing model was Qwen.

As shown in Table 7, for the extraction of organization elements, the best-performing base model was ChatGLM. After fine-tuning, the best-performing model remained ChatGLM.

As shown in Table 8, considering the overall extraction performance for the aforementioned elements, the best-performing base model was ChatGLM, while after fine-tuning, the best-performing model was Baichuan.

For different base models, the initial performance on the four elements varies, and due to the lack of police incident data in the pre-training of current open-source large language models, such as definitions of police incident types and elements, the initial models perform poorly on the three types of elements other than the highly general location element, with both accuracy and recall rates below 60% overall. By using the LoRA fine-tuning method and initial training sets combined with police incident-related knowledge, all six base models showed improvement across the four categories of elements after fine-tuning, demonstrating the effectiveness of LoRA fine-tuning and enhancing the models’ understanding of police incident content.

The performance of the six base models varies, likely due to different pre-training datasets and training strategies used. After fine-tuning, Baichuan shows the best overall performance, and apart from Llama, the other open-source models are not significantly different from Baichuan, with accuracy and recall rate differences within 5%. The Llama model, mainly targeting English, performs poorly in Chinese, but after fine-tuning, both accuracy and recall rates improved by about 50%, showing significant enhancement.

Leveraging large language models for entity extraction effectively utilizes the semantic understanding capabilities of the base models. Taking the extraction of organizational elements as an example, the results can abstract various organization names, even if these names do not appear completely in the police incident text. Moreover, after fine-tuning, the models still retain their general capabilities, providing a foundation for further research and applications.

In summary, the Baichuan model, which demonstrated the best performance after fine-tuning, was selected for further data augmentation experiments.

4.4. Extraction Experiment Based on Data Augmentation

Subsequently, using the zero-shot police data augmentation method based on ChatGPT designed in this study, approximately 6000 simulated police reports were generated. The distribution of the simulated data was observed through dimensionality reduction.

The t-SNE dimensionality reduction visualization of both the generated data and the original data is shown in Figure 4, where triangles represent the generated data and circles represent the original data. This distribution change can explain the observed performance gap between models fine-tuned on synthetic data and those fine-tuned on original data.

The training data were expanded to four times the size of the original dataset, with experiments conducted using generated datasets of 1000, 2000, 3000, 4000, 5000, and 6000 entries. These correspond to 1×, 1.5×, 2×, 2.5×, 3×, 3.5×, and 4× of the original data size, respectively. The results are shown in Table 9, We have bolded the best results in the table.

The comprehensive performance trends as observed in the curve chart are shown in Figure 5. To further elucidate the learning process and demonstrate the model’s convergence, we have included learning curves depicting the evolution of loss during training in Appendix C. These curves provide valuable insights into the stabilization of the model’s performance with respect to the number of training samples, complementing the analysis presented in Figure 5.

It can be observed from the overall upward trend in the data augmentation curve (Figure 5) that the method for generating data is effective and improves model training. However, there is a noticeable decline in performance when the dataset is initially increased by 1.5 to 2 times the size. This may be due to the vector differences between the generated and original data distributions, which can disrupt the model’s understanding of the original data, leading to a slight decrease in performance. As the dataset is further augmented to 2 to 3 times the original size, the model gradually adapts to the new data distribution, and the performance improves and stabilizes. Beyond this ratio, increasing the data augmentation further does not significantly enhance the model’s performance due to the limited complexity of the generated data compared to real-world scenarios.

For the models using the original prompt and the designed prompt, the ablation study results before and after fine-tuning are presented as shown in Table 10, we have bolded the best results in the table. The original prompt and the designed prompt are the prompts designed in the prompt engineering section of this paper (Section 3.1). Specific examples of these prompts can be found in Appendix B.

The ablation study results collectively indicate that prompt engineering and fine-tuning significantly impact the extraction outcomes. LoRA fine-tuning has the most substantial effect on improving the base models, followed by prompt design. Using the original prompts on fine-tuned models also enhances their performance, but selecting prompts consistent with the training set optimally leverages the model’s capabilities, thereby validating the effectiveness of the experimental strategies employed in this study.

4.5. Entity Alignment

Using the entity alignment method proposed in this paper with the help of ChatGPT, the entities were first vectorized by calling OpenAI’s embedding API. Elements with a similarity greater than 90% were extracted. Leveraging the large language model and using other elements as context, further alignment was performed.

After the first stage of screening, most of the entities were filtered out. The remaining entities were grouped by similarity for the second stage of screening, which effectively completed the entity alignment task. Finally, the results were verified manually.

The results of the second step are presented below:

’’’ # You will play the role of an Entity Alignment Expert. For each record, I will provide an organizational element from a police incident and other relevant elements from that incident for reference. Your task is to determine whether these organizational elements belong to the same entity and to separately output the elements that belong to the same entity, ensuring clarity and coherence.

## Record Content:

Record 1:

Organizational Element: [Civil Police]

Reference Elements: [Celebration Event]

Record 2:

Organizational Element: [Police]

Reference Elements: [Lost Phone, Gratitude, Phone]

Record 3:

Organizational Element: [Traffic Police]

Reference Elements: [Breakdown, No Power, Collision, Drunk]

Record 4:

Organizational Element: [Public Security]

Reference Elements: [Investment, Incident]

Record 5:

Organizational Element: [Public Security Bureau]

Reference Elements: [Verification, Phone, QQ Number]

## Output Format:

Same Entity Group 1: [Organizational Element 1, Organizational Element 2, …], Reason: xxx

Same Entity Group 2: [Organizational Element 1, Organizational Element 2, …], Reason: xxx

...

## Please align the entities according to the records above and output the results in the format specified.’’’

ChatGPT Output:

’’’ Entity Alignment Results:

Same Entity Group 1: [Civil Police, Police], Reason: Civil Police and Police are interchangeable terms in Mandarin, commonly used to refer to the same type of law enforcement officers.

Same Entity Group 2: [Traffic Police], Reason: Traffic Police specifically refers to traffic officers, which differ in function from general police.

Same Entity Group 3: [Public Security, Public Security Bureau], Reason: Both Public Security and Public Security Bureau refer to public safety agencies, but Public Security Bureau is a more specific institutional name.’’’

This method allows for entity alignment by modifying the prompts in a systematic manner. It fully leverages ChatGPT’s semantic understanding and contextual capabilities without compromising privacy. This approach effectively aligns the extracted key elements.

5. Theoretical and Practical Implications

This study applies large language models to the extraction of police incident elements, which is of significant importance for research on named entity recognition and practical operations in public security.

5.1. Advances in Entity Extraction

This study proposes a new research approach for the field of entity extraction by demonstrating the effectiveness of combining fine-tuned open-source large language models with domain-specific data engineering and prompt optimization. This method shows how general-purpose language models can adapt to specific domains, achieving high performance in entity extraction tasks without requiring large domain-specific datasets. The success of this approach in extracting complex, context-dependent entities (such as organizations) from unstructured police incident reports extends the boundaries of what is possible in automatic entity recognition.

5.2. Innovative Aspects of the Research

The originality of our work lies in several key areas:

Domain-specific adaptation: We developed a new method to adapt open-source large language models to the sensitive public security domain, addressing privacy concerns while leveraging advanced large language model technology.

Zero-shot data augmentation: Our method of using ChatGPT for zero-shot data augmentation in the field of police incidents is innovative, generating high-quality, domain-specific training data without compromising sensitive information.

Multifaceted optimization: The combination of base model selection, LoRA fine-tuning, data engineering, and prompt optimization represents a comprehensive approach to maximizing model performance for specific tasks.

Entity alignment: The two-stage entity alignment process leverages ChatGPT’s semantic understanding capabilities, providing a new method for refining and standardizing extracted entities in complex domains.

5.3. Benefits for Law Enforcement Bodies

The process presented in this article offers several potential benefits for policing and law enforcement agencies:

Efficiency: Automated extraction of key elements from police reports can significantly reduce manual processing time, allowing personnel to focus on analysis and decision-making. Consistency: The standardized approach to entity extraction ensures uniform processing of reports across different cases and officers. Privacy-preserving innovation: Our approach demonstrates how law enforcement agencies can leverage advanced NLP technologies while maintaining control over sensitive data. Scalability: The method’s ability to handle large volumes of reports efficiently makes it suitable for deployment in agencies of various sizes.

5.4. Maturity for Realistic Testing

Our research has shown promising results in a controlled environment, but further testing and validation are needed before deploying it in real-world law enforcement settings. First, the dataset needs to be expanded, using larger and more diverse real police reports from multiple jurisdictions to validate the model’s robustness. Second, the system needs integration testing to ensure seamless incorporation into existing law enforcement database systems and workflows. User acceptance testing is also crucial, as feedback from law enforcement personnel on the system’s usability and its accuracy in extracting information in real scenarios is vital. Additionally, privacy and security audits must be conducted to ensure the system complies with data protection regulations and law enforcement security standards. Finally, comparative trials are necessary, comparing our system with the manual or semi-automated methods currently used by police departments to evaluate its efficiency and accuracy.

6. Conclusions

This study applies large language model technology to the automated extraction of police incident elements, significantly improving efficiency and saving labor compared to manual processing methods. It overcomes the inefficiencies and lack of semantic understanding in traditional information extraction methods when dealing with large-scale, multi-source heterogeneous police incident data. By combining fine-tuning with base model selection, data engineering, and prompt engineering, different open-source base models were tested. The proposed ChatGPT-based zero-shot data augmentation method, along with prompt design and selection methods, resulted in certain improvements in the effectiveness of police incident element extraction. By choosing the optimal strategy for each part, the most effective large language model for police incident element extraction was obtained. Finally, entity alignment of the extracted police incident elements using ChatGPT was proposed. Experimental results show that our method has significant improvements in police incident element extraction compared to the current state-of-the-art (SOTA) general information extraction models.

In conducting this research, we encountered several major challenges. The primary challenge arose from restrictions related to data privacy and sensitivity. Given that police incident data in China are classified as sensitive information, we had to develop and train our models while strictly adhering to protocols for protecting personal information and sensitive data. This is also why we opted for open-source models for local deployment. In the realm of data engineering, we faced potential model bias issues caused by non-uniform data distribution. To mitigate this, we prioritized ensuring diversity and representativeness of police incidents and extracted elements. Despite these challenges, our research demonstrates the potential of applying large language models to police incident information extraction.

However, this study has certain limitations: There are numerous open-source large language models available, but comprehensively testing all of them was beyond the scope of this research. Additionally, large language models are prone to generating “hallucinations”, which can affect the reliability of outputs. Future work will focus on enhancing the adaptability and scalability of the model, developing its application to multilingual police incident data, improving multimodal capabilities and interactivity, optimizing interaction design, strengthening data security and compliance, and exploring cross-domain applications.

Author Contributions

Conceptualization, X.X.; methodology, X.X.; validation, X.X.; formal analysis, X.X.; data curation, X.X.; writing—original draft preparation, X.X.; writing—review and editing, P.C. and X.X.; visualization, X.X.; supervision, P.C.; project administration, P.C.; funding acquisition, P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received Funding for Discipline Innovation and Talent Introduction Bases in Higher Education Institutions (B20087).

Data Availability Statement

Some or all data, models, or codes that support the findings of this study are available from the corresponding author upon reasonable request. The data are not publicly available due to privacy.

Acknowledgments

We thank the Beijing Municipal Public Security Bureau for authorizing us to use their data for this research. We also thank anonymous reviewers for their comments on improving this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The final optimized prompt utilized in this study is structured as follows:

’’’# Please act as a Prompt Generator:

## Requirements:

### Use this Prompt template as a standard for generating other Prompts: “Act as a public security domain incident generator, using organization: [Organizational Element], key actions: [Action Element], key items: [Item Element] to generate 50 sentences within the specified [environment]. These sentences should mimic the [Incident Template] format but are not limited to this template, and should use a variety of sentence structures. The sentences generated should include and only include the information from the [Seed Elements] for organization, key actions, and key items, avoiding any additional information or explanation.”

# Variables:

1. Combine the seed word file I uploaded to assign different seed words for each Prompt.

The seed words include three categories: “organization”, “key actions”, “key items”, each corresponding to different elements.

2. [Environment] is: In a busy downtown business district surrounded by upscale shops and office buildings, with a huge flow of people at noon. Despite the presence of many pedestrians and store employees (supervisors), targets (such as white-collar workers or tourists) are distracted; describe a potential crime scene of xxx.

## Template:

[Incident Template] is: [Reporter] reports: At [location], [describe the specific situation of the event]. [Time-related information], [possible reasons or motives]. [Reporter’s actions or status].

# Output:

Generate 10 Prompts, each Prompt should be specific enough to guide the generator to produce 20 simulated police reports based on the above conditions.’’’

In each round of data generation, different seed words, scenarios, and police report templates were added to produce a large volume of simulated police report data.

Appendix B

For prompt selection, different prompts were tested on the Baichuan, Qwen, and ChatGLM models to choose the best-performing one for the training set based on their average scores. The first prompt, set as the original prompt, is as follows:

’’’ Please extract the organizations mentioned in the following content, and only output the extracted organizations. If none are found, output ‘none’.

Extraction text: “ “ ’’’

Subsequently, ten prompts were constructed for testing and selection. These prompts incorporated the optimization techniques mentioned earlier. The final chosen prompt combined structured prompt design techniques and further improvements based on the initial prompt. The average precision and recall rates for the original prompt were 16% and 21%, respectively. The optimal prompt achieved 20% precision and 39% recall. The final chosen prompt is as follows:

’’’ #Role: Organization Extraction Expert

##Profile:

-description: Skilled in extracting and outputting organizations involved in police incident texts.

##Background:

Extract organizational entities from the text based on the definition of an organization. An organization is defined as the perpetrator or victim in an incident, including groups or bodies with specific identity or occupational attributes.

##Example:

Including but not limited to: companies, students, minors, security guards, psychiatric patients, depression sufferers, protective umbrellas, criminal forces, veterans, government departments, legal institutions, business entities, foreigners, journalists, petition-related groups, etc. After identification, output directly, e.g.,: [{‘organization’:‘petition-related group’}].

##Goals:

Identify organizations involved in police incident texts and output them in the specified format.

##Constraints:

The organization should be one involved in the incident described in the text, not necessarily mentioned by name.

Only provide answers in the following JSON format: [{‘organization’:‘organization name’}], outputting the extracted organizations and no other unrelated text.

##Workflow:

1. Input: Read and understand the given text.

2. Think: Identify and extract entities that fit the definition of an organization from the text.

3. Output: Strictly provide answers in the following JSON format: [{‘organization’:‘organization name’}]. If no organizations are extracted, output [{‘organization’:‘none’}].

##Input: “ “ ’’’

Appendix C

We selected the key behavior elements that showed the most significant effect of data augmentation and constructed their learning curves as a supplement to Figure 4.

Figure A1. Original training loss per step by training session.

References

Zhang, Z.D. Strategic thinking on the construction of public security big data. J. People‘s Public Secur. Univ. China Soc. Sci. Ed. 2014, 30, 17–23. [Google Scholar]
Zhang, L.; Wang, P.; He, F. Application of natural language processing in intelligent analysis of police situations. Police Technol. 2021, 10, 39–43. [Google Scholar]
Gu, X.P.; Zhang, H.; Wang, J.W.; Yang, B. Research and design of police big data[J/OL]. Electron. World 2020, 16, 208–209. [Google Scholar]
Li, J.; Luo, W.H.; Lin, H.F. Application of natural language processing technology in network case analysis system. Comput. Eng. Appl. 2012, 48, 216–220. [Google Scholar]
Das, P.; Das, A.K. Graph-based clustering of extracted paraphrases for labelling crime reports. Knowl.-Based Syst. 2019, 179, 55–76. [Google Scholar] [CrossRef]
Das, P.; Das, A.K. Graph-based crime reports clustering using relations extracted from named entities. In Computational Intelligence in Data Mining: Proceedings of the International Conference on ICCIDM 2018; Behera, H., Nayak, J., Naik, B., Eds.; Springer: Berlin/Heidelberg, Germany, 2020; pp. 327–337. [Google Scholar]
Wang, M.X.; Zhang, S.; Wang, Y.; Lei, T.; Du, W. Research and application of improved CRNN model in police text classification. J. Appl. Sci. 2020, 38, 388–400. [Google Scholar]
Wang, Y.; Wang, M.X.; Zhang, S.; Du, W. Named entity recognition of police text based on BERT. Comput. Appl. 2020, 40, 535–540. [Google Scholar]
Chen, Y.J.; Xia, Y.F.; Gao, Y.H.; Guo, R.; Tang, Y. Application of police text data analysis based on NLP technology. Police Technol. 2021, 2, 39–42. [Google Scholar]
Carnaz, G.; Nogueira, V.B.; Antunes, M. A graph database representation of Portuguese criminal-related documents. Informatics 2021, 8, 37. [Google Scholar] [CrossRef]
Ning, X.Y.; Sun, G.D.; Jin, L.; Ding, W.J.; Liang, R.H. Interactive visual analysis method for police data. J. Comput. -Aided Des. Graph. 2023, 35, 1064–1076. [Google Scholar]
Deng, Q.Y.; Xxie, S.X.; Zeng, D.J.; Zheng, F.; Cheng, C.; Peng, L.H. An event extraction method for the public security police field. J. Chin. Inf. Process. 2022, 36, 93–101. [Google Scholar]
Lu, Y.J.; Liu, Q.; Dai, D.; Xiao, X.Y.; Lin, H.Y.; Han, X.P.; Sun, L.; Wu, H. Unified Structure Generation for Universal Information Extraction. arXiv 2022, arXiv:2203.12277. [Google Scholar]
Xiao, X.; Wang, Y.; Xu, N.; Wang, Y.; Yang, H.; Wang, M.; Luo, Y.; Wang, L.; Mao, W.; Zeng, D. YAYI-UIE: A Chat-Enhanced Instruction Tuning Framework for Universal Information Extraction. arXiv 2024, arXiv:2312.15548. [Google Scholar]
Wang, X.; Zhou, W.; Zu, C.; Xia, H.; Chen, T.; Zhang, Y.; Zheng, R.; Ye, J.; Zhang, Q.; Gui, T.; et al. InstructUIE: Multi-task Instruction Tuning for Unified Information Extraction. arXiv 2023, arXiv:2304.08085. [Google Scholar]
Lou, J.; Lu, Y.; Dai, D.; Jia, W.; Lin, H.; Han, X.; Sun, L.; Wu, H. Universal Information Extraction as Unified Semantic Matching. arXiv 2023, arXiv:2301.03282. [Google Scholar] [CrossRef]
Wei, X.; Cui, X.; Cheng, N.; Wang, X.; Zhang, X.; Huang, S.; Xie, P.; Xu, J.; Chen, Y.; Zhang, M.; et al. Zero-Shot Information Extraction via Chatting with ChatGPT. arXiv 2023, arXiv:2302.10205. [Google Scholar]
Tang, R.; Han, X.; Jiang, X.; Hu, X. Does synthetic data generation of LLMs help clinical text mining? arXiv 2023, arXiv:2303.04360. [Google Scholar]
Ling, C.; Zhao, X.; Lu, J.; Deng, C.; Zheng, C.; Wang, J.; Chowdhury, T.; Li, Y.; Cui, H.; Zhang, X.; et al. Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey. arXiv 2023, arXiv:2305.18703. [Google Scholar]
Bolton, E.; Venigalla, A.; Yasunaga, M.; Hall, D.; Xiong, B.; Lee, T.; Daneshjou, R.; Frankle, J.; Liang, P.; Carbin, M.; et al. BioMedLM: A 2.7B parameter language model trained on biomedical text. arXiv 2023, arXiv:2403.18421. [Google Scholar]
Wu, S.; Irsoy, O.; Lu, S.; Dabravolski, V.; Dredze, M.; Gehrmann, S.; Kambadur, P.; Rosenberg, D.; Mann, G. BloombergGPT: A Large Language Model for Finance. arXiv 2023, arXiv:2303.17564. [Google Scholar]
Hu, X.Y.; Zhang, W. Investigation and interrogation text data mining and analysis based on big data. J. People‘s Public Secur. Univ. China Soc. Sci. Ed. 2019, 35, 35–43. [Google Scholar]
Weng, D. On the basic structure of criminal cases and investigative practice. J. Police Univ. 1991, 6, 2–5. [Google Scholar]
Ren, H.H. Criminal Case Investigation; Law Press: Beijing, China, 2000; p. 19. [Google Scholar]
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020; Volume 159, pp. 1–25. [Google Scholar]
Open, A.I. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Chen, W.; Zhao, L.; Luo, P.; Xu, T.; Zheng, Y.; Chen, E. Heproto: A hierarchical enhancing protonet based on multi-task learning for few-shot named entity recognition. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 2023, Birmingham, UK, 21–25 October 2023; pp. 296–305. [Google Scholar]
Albalak, A.; Elazar, Y.; Xie, S.M.; Longpre, S.; Lambert, N.; Wang, X.; Muennighoff, N.; Hou, B.; Pan, L.; Jeong, H.; et al. A Survey on Data Selection for Language Models. arXiv 2024, arXiv:2402.16827. [Google Scholar]
Wang, Z.; Lu, Z.; Jin, B.; Deng, H. MediaGPT: A large language model for Chinese media. arXiv 2023, arXiv:2307.10930. [Google Scholar]
Wang, J.; Zhang, B.; Du, Q.; Zhang, J.; Chu, D. A Survey on Data Selection for LLM Instruction Tuning. arXiv 2024, arXiv:2402.05123. [Google Scholar]
Zhang, B.; Liu, Z.; Cherry, C.; Firat, O. When Scaling Meets LLM Finetuning: The Effect of Data, Model and Finetuning Method. arXiv 2024, arXiv:2402.17193. [Google Scholar]
Qian, L.; Hu, M.D.; Chang, Z.J. A review of research progress in question-answering technology based on large language models. Data Anal. Knowl. Discov. 2023, 10, 1–17. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Chen, L.; Ye, Z.; Wu, Y.; Zhuo, D.; Ceze, L.; Krishnamurthy, A. Punica: Multi-Tenant LoRA Serving. arXiv 2023, arXiv:2310.18547. [Google Scholar]
Wei, J.; Zou, K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv 2019, arXiv:1901.11196. [Google Scholar] [CrossRef]
Yu, P.; Zhang, R.; Zhao, Y.; Zhang, Y.; Li, C.; Chen, C. SDA: Improving Text Generation with Self Data Augmentation. arXiv 2021, arXiv:2101.03236. [Google Scholar]
Chen, J.; Zhang, R.; Luo, Z.; Hu, C.; Mao, Y. Adversarial word dilution as text data augmentation in low-resource regime. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence (AAAI’23/IAAI‘23/EAAI’23), Washington DC, USA, 7–14 February 2023; p. 1417. [Google Scholar] [CrossRef]
Piedboeuf, F.; Langlais, P. Data Augmentation is Dead, Long Live Data Augmentation. arXiv 2024, arXiv:2402.14895. [Google Scholar]
Liu, P.; Yuan, W.; Fu, J.; Jiang, Z.; Hayashi, H.; Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Zhong, Q.; Ding, L.; Liu, J.; Du, B.; Tao, D. Can ChatGPT Understand Too? A Comparative Study on ChatGPT and Fine-tuned BERT. arXiv 2023, arXiv:2302.10198. [Google Scholar]
Bsharat, S.M.; Myrzakhan, A.; Shen, Z. Principled instructions are all you need for questioning LLaMA-1/2, GPT-3.5/4. arXiv 2023, arXiv:2312.16171. [Google Scholar]
Pryzant, R.; Iter, D.; Li, J.; Lee, Y.T.; Zhu, C.; Zeng, M. Automatic Prompt Optimization with “Gradient Descent” and Beam Search. arXiv 2023, arXiv:2305.03495. [Google Scholar]
Guo, Q.; Wang, R.; Guo, J.; Li, B.; Song, K.; Tan, X.; Liu, G.; Bian, J.; Yang, Y. Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers. arXiv 2023, arXiv:2309.08532. [Google Scholar]
Li, C.; Liu, X.; Wang, Y.; Li, D.; Lan, Y.; Shen, C. Dialogue for Prompting: A Policy-Gradient-Based Discrete Prompt Optimization for Few-shot Learning. arXiv 2023, arXiv:2308.07272. [Google Scholar]
Wang, X.; Li, C.; Wang, Z.; Bai, F.; Luo, H.; Zhang, J.; Jojic, N.; Xing, E.P.; Hu, Z. PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization. arXiv 2023, arXiv:2310.16427. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv 2023, arXiv:2201.11903. [Google Scholar]
Yao, S.; Yu, D.; Zhao, J.; Shafran, I.; Griffiths, T.L.; Cao, Y.; Narasimhan, K. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv 2023, arXiv:2305.10601,. [Google Scholar]
Hu, H.; Lu, H.; Zhang, H.; Song, Y.-Z.; Lam, W.; Zhang, Y. Chain-of-Symbol Prompting Elicits Planning in Large Language Models. arXiv 2023, arXiv:2305.10276. [Google Scholar]
Yang, C.; Wang, X.; Lu, Y.; Liu, H.; Le, Q.V.; Zhou, D.; Chen, X. Large Language Models as Optimizers. arXiv 2023, arXiv:2309.03409. [Google Scholar]
Wang, H.; Prakash, N.; Hoang, N.K.; Hee, M.S.; Naseem, U.; Lee, R.K.-W. Prompting Large Language Models for Topic Modeling. arXiv 2023, arXiv:2312.09693. [Google Scholar]
Gui, H.; Ye, H.; Yuan, L.; Zhang, N.; Sun, M.; Liang, L.; Chen, H. IEPile: Unearthing Large-Scale Schema-Based Information Extraction Corpus. arXiv 2024, arXiv:2402.14710. [Google Scholar]
Du, Z.; Qian, Y.; Liu, X.; Ding, M.; Qiu, J.; Yang, Z.; Tang, J. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 320–335. [Google Scholar]
Liu, Y.; Cao, J.; Liu, C.; Ding, K.; Jin, L. Datasets for Large Language Models: A Comprehensive Survey. arXiv 2024, arXiv:2402.18041. [Google Scholar]
Baichuan. Baichuan 2: Open Large-Scale Language Models. arXiv 2023, arXiv:2309.10305. [Google Scholar]
Young, A.; Chen, B.; Li, C.; Huang, C.; Zhang, G.; Zhang, G.; Li, H.; Zhu, J.; Chen, J.; Chang, J.; et al. Yi: Open Foundation Models by 01.AI. arXiv 2024, arXiv:2403.04652. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K. Llama 2: Open foundation and fine-tuned chat models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas D de las Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar]
Kusupati, A.; Bhatt, G.; Rege, A.; Wallingford, M.; Sinha, A.; Ramanujan, V.; Howard-Snyder, W.; Chen, K.; Kakade, S.; Jain, P.; et al. Matryoshka Representation Learning. Adv. Neural Inf. Process. Syst. 2024, 35, 30233–30249. [Google Scholar]

Figure 1. Overall process diagram.

Figure 2. Overall process of zero-shot data augmentation based on ChatGPT.

Figure 3. ChatGPT-based entity alignment.

Figure 4. t-SNE dimensionality reduction distribution.

Figure 5. Comprehensive effectiveness of augmentation multipliers.

Table 1. Open-source foundation models.

Model	Parameters (Billion)	R & D Company	Releases
ChatGLM [52]	6	Zhipu AI, China	3
Baichuan [54]	7	Baichuan-ai, China	2
Yi [55]	6	01.AI, China	1
Llama [56]	7	Meta, USA	2
Mistral [57]	7	Mistral AI, France	1
Qwen [58]	7	Alibaba, China	1.5

Table 2. Experimental environment configuration.

Name	Configuration
Operating system	Linux-5.4.0-144-generic
Programming language	Python 3.10.0
CUDA version	pytorch:2.0.0-cuda11.8
CPU	Intel(R) Xeon(R) Gold 6240 CPU @ 2.60 GHz
GPU	NVIDIA A100-PCIE-40 GB
RAM	256 GB

Table 3. UIE Results.

Element	P	R	F1
Location	0.940 ± 0.012	0.790 ± 0.010	0.859 ± 0.002
Key items	0.817 ± 0.010	0.683 ± 0.008	0.741 ± 0.006
Key behaviors	0.937 ± 0.009	0.853 ± 0.009	0.890 ± 0.001
Organizations	0.577 ± 0.028	0.443 ± 0.026	0.503 ± 0.007

Table 4. Location element extraction results.

Model	Original			After Fine-Tuning
Model	P	R	F1	P	R	F1
ChatGLM	0.737	0.837	0.783	0.837 ± 0.002	0.837 ± 0.002	0.837 ± 0.002
Qwen	0.722	0.884	0.795	0.878 ± 0.011	0.878 ± 0.011	0.878 ± 0.011
Yi	0.582	0.605	0.593	0.905 ± 0.008	0.905 ± 0.008	0.905 ± 0.008
Llama	0.357	0.509	0.419	0.891 ± 0.010	0.891 ± 0.010	0.891 ± 0.010
Baichuan	0.649	0.653	0.651	0.905 ± 0.008	0.905 ± 0.008	0.905 ± 0.008
Mistral	0.517	0.728	0.605	0.857 ± 0.014	0.857 ± 0.014	0.857 ± 0.014

Table 5. Key action element extraction results.

Model	Original			After Fine-Tuning
Model	P	R	F1	P	R	F1
ChatGLM	0.615	0.444	0.516	0.886 ± 0.001	0.920 ± 0.004	0.903 ± 0.003
Qwen	0.672	0.573	0.618	0.898 ± 0.003	0.945 ± 0.002	0.921 ± 0.001
Yi	0.619	0.646	0.632	0.854 ± 0.004	0.914 ± 0.005	0.883 ± 0.004
Llama	0.367	0.204	0.263	0.903 ± 0.002	0.842 ± 0.003	0.871 ± 0.002
Baichuan	0.624	0.646	0.635	0.908 ± 0.001	0.937 ± 0.001	0.922 ± 0.003
Mistral	0.450	0.379	0.411	0.913 ± 0.003	0.928 ± 0.004	0.921 ± 0.002

Table 6. Key item element extraction results.

Model	Original			After Fine-Tuning
Model	P	R	F1	P	R	F1
ChatGLM	0.662	0.655	0.659	0.854 ± 0.003	0.832 ± 0.004	0.843 ± 0.003
Qwen	0.595	0.662	0.627	0.887 ± 0.001	0.857 ± 0.002	0.869 ± 0.001
Yi	0.464	0.590	0.520	0.857 ± 0.003	0.823 ± 0.004	0.840 ± 0.004
Llama	0.129	0.116	0.122	0.837 ± 0.005	0.788 ± 0.006	0.812 ± 0.007
Baichuan	0.254	0.509	0.339	0.874 ± 0.002	0.853 ± 0.002	0.864 ± 0.002
Mistral	0.413	0.468	0.438	0.869 ± 0.004	0.853 ± 0.004	0.861 ± 0.004

Table 7. Organization element extraction results.

Model	Original			After Fine-Tuning
Model	P	R	F1	P	R	F1
ChatGLM	0.353	0.401	0.376	0.918 ± 0.002	0.860 ± 0.003	0.888 ± 0.002
Qwen	0.171	0.252	0.204	0.890 ± 0.003	0.842 ± 0.004	0.866 ± 0.003
Yi	0.231	0.257	0.243	0.898 ± 0.003	0.833 ± 0.004	0.864 ± 0.003
Llama	0.146	0.365	0.209	0.508 ± 0.020	0.455 ± 0.017	0.480 ± 0.020
Baichuan	0.211	0.401	0.277	0.900 ± 0.002	0.851 ± 0.003	0.875 ± 0.002
Mistral	0.276	0.401	0.327	0.894 ± 0.008	0.838 ± 0.012	0.865 ± 0.011

Table 8. Micro-integrated results.

Model	Original			After Fine-Tuning
Model	P	R	F1	P	R	F1
ChatGLM	0.592	0.584	0.583	0.874	0.862	0.868
Qwen	0.540	0.593	0.561	0.887	0.880	0.883
Yi	0.474	0.525	0.497	0.879	0.869	0.873
Llama	0.250	0.298	0.253	0.785	0.744	0.764
Baichuan	0.434	0.552	0.475	0.897	0.887	0.891
Mistral	0.414	0.494	0.445	0.883	0.869	0.876

Table 9. Augmentation results comparison.

	Initial		1.5×		2.5×		3×		3.5×		4×
Element	P	R	P	R	P	R	P	R	P	R	P	R
Location	0.905	0.905	0.925	0.925	0.932	0.932	0.939	0.939	0.939	0.939	0.946	0.946
Key behaviors	0.908	0.937	0.898	0.922	0.911	0.931	0.932	0.956	0.912	0.943	0.917	0.947
Key items	0.874	0.853	0.907	0.894	0.917	0.901	0.924	0.908	0.918	0.911	0.910	0.901
Organizations	0.900	0.851	0.899	0.838	0.903	0.886	0.927	0.914	0.926	0.921	0.925	0.919
Micro	0.897	0.887	0.907	0.895	0.916	0.912	0.930	0.929	0.924	0.929	0.924	0.928

Table 10. Ablation experiment results.

Experiment Type	Original Prompt		Designed Prompt
Experiment Type	P	R	P	R
Original LLM	0.20800	0.30600	0.21140	0.40090
Fine-tuned LLM	0.41837	0.36937	0.92594	0.92105

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xing, X.; Chen, P. Entity Extraction of Key Elements in 110 Police Reports Based on Large Language Models. Appl. Sci. 2024, 14, 7819. https://doi.org/10.3390/app14177819

AMA Style

Xing X, Chen P. Entity Extraction of Key Elements in 110 Police Reports Based on Large Language Models. Applied Sciences. 2024; 14(17):7819. https://doi.org/10.3390/app14177819

Chicago/Turabian Style

Xing, Xintao, and Peng Chen. 2024. "Entity Extraction of Key Elements in 110 Police Reports Based on Large Language Models" Applied Sciences 14, no. 17: 7819. https://doi.org/10.3390/app14177819

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Entity Extraction of Key Elements in 110 Police Reports Based on Large Language Models

Abstract

1. Introduction

2. Related Work

2.1. Definition and Extraction Methods of Police Report Elements

2.2. Development of Large Language Models for Information Extraction

2.3. Optimization Techniques for Large Language Models in Information Extraction

3. Materials and Methods

3.1. Prompt Engineering

3.2. Dataset Construction and Enhancement

3.2.1. Dataset Construction

3.2.2. Seed Word Library

3.2.3. Scenario Simulation

3.2.4. Police Report Templates

3.2.5. Post-Processing

3.2.6. Summary

3.3. Base Model Selection and Fine-Tuning Methods

3.4. Entity Alignment

3.5. Experimental Environment and Evaluation Metrics

4. Results and Analysis

4.1. Baseline

4.2. Prompt Selection

4.3. Base Model Fine-Tuning

4.4. Extraction Experiment Based on Data Augmentation

4.5. Entity Alignment

5. Theoretical and Practical Implications

5.1. Advances in Entity Extraction

5.2. Innovative Aspects of the Research

5.3. Benefits for Law Enforcement Bodies

5.4. Maturity for Realistic Testing

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI