Military Equipment Entity Extraction Based on Large Language Model

Liu, Xuhong; Yu, Zhipeng; Liu, Xiulei; Miao, Lin; Yang, Tao

doi:10.3390/app14199063

Open AccessArticle

Military Equipment Entity Extraction Based on Large Language Model

by

Xuhong Liu

,

Zhipeng Yu

^*,

Xiulei Liu

,

Lin Miao

and

Tao Yang

Laboratory of Data Science and Information Studies, Beijing Information Science and Technology University, Beijing 100101, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(19), 9063; https://doi.org/10.3390/app14199063

Submission received: 2 August 2024 / Revised: 15 September 2024 / Accepted: 6 October 2024 / Published: 8 October 2024

Download

Browse Figures

Versions Notes

Abstract

:

The technology of military equipment entity extraction, a crucial component in constructing military knowledge bases, holds significant research value and theoretical importance for guiding the development and improvement of equipment support forces. In the military domain, equipment entities exhibit a phenomenon of nesting, where one entity is contained within another, and abbreviations or codes are frequently used to represent these entities. To address this complexity, this paper proposes a method named CoTNER for extracting entities. Initially, a large-scale language model is used to perform data augmentation with chain-of-thought on the original dataset, providing additional semantic and contextual information. Subsequently, the augmented dataset is fine-tuned on a small-scale language model to adapt it to the task of military equipment entity extraction and to enhance its ability to learn complex rules specific to the domain of military equipment. Additionally, a high-quality data filtering strategy based on instruction-following difficulty scoring is proposed to address the catastrophic forgetting issue that may occur during the fine-tuning of large language models. The experimental results demonstrate that the proposed military equipment entity extraction method outperforms mainstream traditional deep learning methods, validating the effectiveness of CoTNER.

Keywords:

military equipment entity extraction; data augmentation with chain of thought; fine-tuning; high-quality data filtering

1. Introduction

In the military domain, equipment entities refer to key components associated with military assets, encompassing specific categories of military weapons and equipment, such as ship, aircraft, and other related systems. Modern military technology is gradually achieving informatization and intelligence, leading to increasingly complex entity names for various types of equipment. Relying on manual selection and editing of equipment entity names, entity types, attributes, and other knowledge can no longer meet the needs of relevant personnel for acquiring and analyzing equipment knowledge. Entity extraction, also known as named entity recognition, as a technique for automatically extracting structured information from massive amounts of unstructured text, provides a powerful tool and method for effectively addressing these challenges. Traditional deep learning models using sequence labeling methods perform well on flat data; however, their generalization ability is hindered in complex data scenarios (e.g., nested entities, discontinuous entities). With the advent of large language models (LLMs), which exhibit excellent capabilities in text understanding, generation, and generalization, the development of related technologies in the field of natural language processing has been significantly advanced [1,2,3]. However, in the task of entity extraction, the performance of LLMs based on context learning paradigms is limited in the complex domain of military equipment texts. For instance, texts might simultaneously contain entities such as “tank” and “armored vehicle”, making it difficult to distinguish between them with only a few prompt examples. Additionally, nested entities are present in intelligence texts within the military equipment domain. For example, in the text “using a solid rocket engine helps it become a precise, reliable, and effective ballistic missile system”, the phrase “ballistic missile system” should be annotated as a “system” entity, but the model might mistakenly annotate the substring “ballistic missile” as a “missile” entity.

Currently, LLMs can be categorized based on their parameter size into large-scale LLMs with hundreds of billions of parameters (e.g., GPT-3 175B [4]) and small-scale LLMs with tens of billions of parameters (e.g., Llama-2 7B [5]). Large-scale LLMs, with their vast number of parameters, can capture more complex and nuanced semantic information, thus performing better on complex tasks. However, they require more computational resources for training and inference, leading to higher deployment and usage costs in practical applications. In contrast, small-scale LLMs, though less performant, incur lower costs during fine-tuning, making them suitable for applications in the military domain where resources are limited and rapid response is required for equipment entity extraction tasks.

To enhance the performance of LLMs in the entity extraction task within the military equipment domain, this paper proposes a method that leverages the synergy between large-scale and small-scale LLMs. The main contributions of this paper are as follows:

Data augmentation with chain of thought: Using GPT-3.5, the Chain of Thought (CoT) [6] method is employed to augment the original data, adding explanations for the entity label results for each piece of data.
Fine-tuning with instructions: The augmented data are then used to fine-tune Llama-3 8B with instructions, thereby improving the model’s logical reasoning capabilities and its ability to learn complex rules in the military equipment domain.
High-quality data filtering: A high-quality data filtering strategy is proposed to address the issue of catastrophic forgetting. This strategy introduces an Instruction-Following Difficulty (IFD) score, which measures the complexity of entities in the text. Only data with IFD scores above a predefined threshold are selected for model fine-tuning, maximizing the retention of knowledge acquired by the base Llama-3 model during pre-training.

Experiments were conducted on four public entity extraction datasets and a self-constructed military equipment domain dataset. The experimental results show that the proposed CoTNER method outperforms traditional deep learning methods and existing LLM-based methods in terms of performance.

2. Related Works

Entity extraction is a fundamental task in the field of natural language processing, aiming to extract entities with specific meanings from unstructured text. Based on the complexity and types of the extraction objects, entity extraction can be categorized into simple entity extraction and complex entity extraction. Simple entity extraction primarily targets flat entity types, which usually have relatively fixed formats and clear contexts. Complex entity extraction, on the other hand, focuses on extracting entities that are complex, diverse, and highly context-dependent, including nested and discontinuous entities.

The methods for entity extraction have undergone three main stages of development. The first stage involves domain-specific entity extraction, which focuses on fixed expressions within a domain by formulating domain rules based on data characteristics. Collins et al. [7] proposed a representative method called DL-CoTrain. This method first manually constructs decision lists, then, performs unsupervised learning on the dataset to build more rule sets, and finally, classifies these rule sets. The effectiveness of this method is positively correlated with the quality and quantity of the manually constructed rule sets. Consequently, such methods are highly sensitive to changes in new data and have poor transferability.

With the development of machine learning, simple entity extraction entered the second stage, where entity extraction was treated as a sequence labeling problem. This involved building models and manually extracting features to label each word in the text, thereby achieving entity extraction. Mikheev et al. [8] were the first to apply the maximum entropy model to the entity extraction task, combining rule-based syntax with the maximum entropy model. This method incurred significant overhead due to the need for explicit normalization calculations. Subsequently, Bikel et al. [9] proposed a standard Hidden Markov Model (HMM), which directly modeled transition and emission probabilities, reducing computational overhead by statistically modeling co-occurrence probabilities. However, these methods normalized only locally, which could lead to local optima. The Conditional Random Field (CRF) model, which statistically considers global probabilities and normalizes the entire data distribution, was introduced into entity extraction by McCallum and Li [10], achieving good results. These methods have some transferability but require manual feature selection, placing demands on the feature selection capabilities of relevant personnel.

With the advancement of deep learning technology, entity extraction has entered the third stage, based on deep neural networks. Deep learning-based entity extraction typically employs models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Bidirectional Long Short-Term Memory (BiLSTM) networks, and Transformers, often combined with CRF for joint training to improve extraction accuracy. Collobert et al. [11] were the first to use CNNs to embed each input word and employed CRF to decode each embedding into specific entities. Unlike fully connected neural networks that require fixed input lengths, RNNs consist of recurrent units that can handle variable-length data, making them more suitable for sequential inputs like text data. However, RNNs face issues of gradient vanishing or exploding during backpropagation when dealing with long sequences, which severely affect model training. BiLSTM introduces gating mechanisms to control information flow, allowing the forgetting of redundant information, making it more suitable for long sequence scenarios. Ma et al. [12] used BiLSTM to enhance memory of effective information and employed CRF for globally optimal prediction of label sequences. These methods perform well on flat text but are limited in performance when dealing with nested or discontinuous complex entities. Li et al. [13] reformulated the entity extraction task as a Machine Reading Comprehension (MRC) task, addressing both simple and nested entity recognition tasks uniformly, and further improved MRC model performance using dice loss. Compared to machine learning methods, deep learning-based methods significantly improve the ability to automatically learn language features and capture deep semantics, without requiring manual design of complex features. However, traditional deep learning methods have limited contextual understanding capabilities in entity extraction tasks, and their performance in extracting entities from complex texts heavily depends on the quality and scale of domain-specific training data.

With the development of Large Language Models (LLMs), significant performance improvements have been achieved across various natural language processing tasks [14,15,16]. Compared to traditional deep learning methods, LLMs are pre-trained on large-scale corpora, providing rich contextual understanding and language generation capabilities. They excel in capturing lexical and syntactic information and can be fine-tuned with minimal task-specific data to quickly adapt to specific tasks. However, entity extraction tasks are fundamentally sequence labeling tasks, while LLMs are text generation models, resulting in a mismatch of task objectives. Wang et al. [17] addressed this gap by using special tokens surrounding entities in LLM prompts to generate tagged sequences, thereby bridging the gap between sequence labeling tasks and generation tasks. To further enhance LLM performance under data scarcity, Ashok et al. [18] prompted LLM to generate a list of potential entities and provided explanations demonstrating their compatibility with provided entity type definitions, thereby improving LLM performance in low-sample entity extraction tasks. While effective within specific domains, this method exhibits poor generalization across multiple domains. Zhou et al. [19] utilized ChatGPT to automatically generate large volumes of data, enabling fine-tuning of the base LLM model to extract deep semantic knowledge and identify entities across diverse domains. The above methods leverage large-scale LLMs with their extensive parameters and rich training data, demonstrating outstanding performance in capturing complex contextual relationships and better generalizing across different domains. However, these models require substantial computational resources and storage space, with slower inference speeds, making them less suitable for military scenarios requiring rapid response. In contrast, small-scale LLMs fine-tuned via supervised learning require fewer computational resources, enabling quick deployment and integration into various applications. Yet, constrained by parameter size, their performance in complex data entity extraction remains limited. Thus, ensuring high performance and efficiency of LLMs in military equipment entity extraction tasks, characterized by complex entity names and diverse terminology, poses a challenging problem.

3. Method

The proposed CoTNER method framework is illustrated in Figure 1. This method consists of three main components: the data augmentation with the chain of thought Module, the high-quality data filtering module, and the fine-tuning with instructions module. Initially, GPT-3.5 is employed to enhance the original data using CoT. Subsequently, the enhanced data are sampled and used to fine-tune the base Llama3 model, resulting in an empirical Llama3 model. Finally, the empirical Llama3 model assesses the difficulty of following instructions in the original data, filtering out high-quality data for fine-tuning the base Llama3 model. This process culminates in the development of a domain-specific large model, the Military Equipment Llama3 model, equipped with knowledge tailored for entity extraction tasks in the military equipment domain.

3.1. Data Augmentation with Chain of Thought

Addressing the complexity and diversity of equipment entity names in military equipment text, this paper employs chain of thought prompts to guide the model in breaking down complex equipment entity extraction problems into step-by-step subproblems and solving them sequentially. The core steps involve constructing prompt templates for the data that need to be annotated and using GPT-3.5 to generate annotated data with chain of thought explanations.

Large-scale language models (LLMs) can capture more complex semantic and contextual information. In this paper, we utilize the GPT-3.5 model with 175 billion parameters. By incorporating chain of thought style explanations based on input text and entity labels into the prompts, interpretative explanations for the extracted results are provided. Below are some examples of annotated data generated by GPT-3.5 with a CoT-style explanation:

Text: The Russian Navy’s Admiral Gorshkov frigate has arrived near the coast of the United States.
Answer:
[
	{
		“entity”: “Admiral Gorshkov frigate”,
		“category”: “Ship”,
		“explanation”: “The Admiral Gorshkov frigate is a type of naval vessel belonging to the Russian Navy, thus classified as a ship.”
	}
	{
		“entity”: “United States”,
		“category”: “Country”,
		“explanation”: “The United States is a nation, thus classified as a country.”
	}
]

Text: On the 15th, India once again successfully test-fired an Agni-V ballistic missile from Wheeler Island in its eastern region.
Answer:
[
	{
		“entity”: “India”,
		“category”: “Country”,
		“explanation”: “India is a sovereign country, hence it falls under the Country category.”
	}
	{
		“entity”: “Agni-V”,
		“category”: “Missile”,
		“explanation”: “Agni-V is a type of ballistic missile, hence it falls under the Missile category.”
	}
	{
		“entity”: “Wheeler Island”,
		“category”: “Location”,
		“explanation”: “Wheeler Island is a location situated in the eastern part of India, hence it falls under the Location category.”
	}
]

In practice, GPT-3.5 is initially provided with several complete examples as described above, allowing it to generate the explanation parts for other data based on these examples, as illustrated in Figure 2. This approach yields an enhanced dataset for military equipment. Subsequently, this enhanced dataset serves as the original data for further data filtering.

3.2. High-Quality Data Filtering

If the enhanced data are directly used to fine-tune the base Llama3 model, it may lead to catastrophic forgetting, wherein the model, after extensive fine-tuning, forgets a large amount of knowledge learned during pre-training while acquiring new domain knowledge. This significantly diminishes the model’s text comprehension ability. To address this, we propose a high-quality data filtering strategy, selecting a small portion of the more complex data from the entire dataset for fine-tuning. High-quality data encompasses a broader range of equipment entity types and provides better representation. Such data enable the model to learn more comprehensively and reduces over-reliance on specific data distributions. Furthermore, high-quality data contain denser and more accurate information, which helps the model retain critical features during fine-tuning and minimizes interference from noise or irrelevant information. This strategy is divided into three core phases: first, a small representative subset of the training data is selected to train the base Llama3 model, resulting in an empirical Llama3 model with basic instruction-following capabilities. Subsequently, we introduce a scoring system to evaluate the instruction-following difficulty for each datum. Using the empirical Llama3 model, we calculate the instruction-following difficulty (IFD) scores for all the original data. These scores are then sorted in descending order, and the top 10% of the data are selected for fine-tuning the base Llama3 model. A higher IFD score indicates that the instruction alignment effect is poorer when the model is trained with that data, implying higher data quality.

3.2.1. Training of Empirical Llama3 Model

This phase aims to train the base Llama3 model with a small amount of representative data to equip it with basic instruction-following capabilities. Initially, the base Llama3 model is used to vectorize the training set data. Specifically, suppose the dataset

D_{0}

contains n samples

x = (I n s t r u c t i o n, T e x t, A n s w e r)

; then, define the string

Q u e s t i o n = c o n c a t (I n s t r u c t i o n + T e x t)

as the complete instruction. Each character in

Q u e s t i o n (Q)

and

A n s w e r (A)

is represented as

x_{i}^{Q}

and

x_{i}^{A}

. Let

{L L M}_{θ}

denote the LLM used and

θ

denote the weights of the LLM. By obtaining the last hidden state of the LLM, the instruction embedding for each sample j is derived:

[h_{j, 1}^{Q}, . . h_{j, m}^{Q}] = {L L M}_{θ} (x_{j, 1}^{Q}, . . x_{j, m}^{Q})

(1)

h_{j}^{Q} = \frac{\sum_{i = 1}^{m} h_{j, i}^{Q}}{m}

(2)

where

x_{j, i}^{Q}

represents the i-th character of

Q u e s t i o n

string for sample

j

, and

h_{j, i}^{Q}

denotes its corresponding vector representation.

To ensure diversity in the training data for the base Llama3 model, the vectorized instruction samples are clustered using the KMeans algorithm to generate k clusters. Here, k depends on the number of equipment entity categories in the military domain dataset. From each cluster, 10 samples are randomly selected. These selected samples are then used to train the empirical Llama3 model, endowing it with basic instruction-following capabilities.

3.2.2. Instruction-Following Difficulty Score

The complexity of instruction samples indicates the complexity of entities within the samples (e.g., containing nested and discontinuous entities). Such samples are crucial for the model to learn. To evaluate the complexity of each instruction sample, we introduce the instruction-following difficulty score. The IFD is determined by comparing the loss of the model’s generated answers with and without the provided instructions as context.

In the context where instructions are provided, the loss for the sample pair (Q, A) is calculated by continuously predicting the next character given the instruction Q and subsequent characters:

L_{θ} (A| Q) = - \frac{1}{N} \sum_{i = 1}^{N} l o g P (x_{i}^{A}| Q, x_{1}^{A}, x_{2}^{A}, \dots, x_{i - 1}^{A}; θ)

(3)

where N represents the number of characters in the correct answer A. This average cross-entropy loss can be expressed as the conditional answer score

S_{θ} (A | Q) = L_{θ} (A | Q)

. This metric evaluates the model’s ability to generate the correct answer based on the provided instructions. It measures the degree of consistency between the answer generated by the model according to the instruction and the correct answer.

However, a higher

S_{θ} (A | Q)

score does not necessarily imply a more complex instruction. In some cases, this is due to the inherent characteristics of the answer A, such as its complex format. Therefore, to more accurately assess the IFD in a given sample, the model directly continues the answer, yielding the direct answer score

S_{θ} (A)

:

S_{θ} (A) = - \frac{1}{N} \sum_{i = 1}^{N} l o g P (x_{i}^{A}| x_{1}^{A}, \dots, x_{i - 1}^{A}; θ)

(4)

This score measures the large language model’s ability to generate the answer independently without the guidance of its corresponding instruction context. A higher direct answer score indicates that the answer itself has a complex format, which affects the assessment of the instruction’s complexity.

To eliminate the influence of the answer itself, it is necessary to further balance the inherent characteristics of the answer and the complexity of the instruction. This is done by calculating the ratio between

S_{θ} (A)

and

S_{θ} (A | Q)

to obtain the Instruction-Following Difficulty score

{I F D}_{θ} (Q, A)

for a given (Q, A) pair, as shown in Equation (5):

{I F D}_{θ} (Q, A) = \frac{S_{θ} (A| Q)}{S_{θ} (A)}

(5)

Using the IFD score to filter the data mitigates the impact of the inherent characteristics of answer A and directly measures the influence of the given instruction content on the model’s answer generation. A higher IFD score indicates that the model fails to align the answer with the given instruction content, which also implies a higher difficulty of the instruction, making it more beneficial for model fine-tuning. Under normal circumstances, since

S_{θ} (A | Q)

is calculated with the given context, predicting the next character should be easier, and therefore, it is always less than

S_{θ} (A)

. If the IFD score is greater than 1, it indicates that

S_{θ} (A | Q)

is greater than

S_{θ} (A)

, meaning that the given instruction did not provide effective contextual information for answer prediction. To further filter out high-quality instruction samples, a threshold of 1 is set. Finally, among the samples with IFD scores greater than the threshold, the top 10% of the sample data are selected as training data for the military equipment Llama3 model.

3.3. Model Instruction Fine-Tuning

To fine-tune the basic Llama3 model with the high-quality data selected in the previous section, and to adapt it for entity extraction tasks in the military equipment domain ensuring the model can decompose complex questions during inference, the instruction part of Section 4.1’s prompts needs modification.

Figure 3 illustrates an example of prompt construction for military equipment entity extraction using the Llama3 model. It consists of three main parts: instruction, text, and answer.

The instruction section provides a task description, which can be further divided into three components: the first sentence specifies the model’s task of extracting entities from text; the second sentence confines the entity category, crucial for military equipment entity extraction tasks, as the natural language meaning of “entity” may encompass categories beyond military equipment; the third sentence instructs the LLM to decompose the problem and solve it step by step, providing an explanation of the extraction results.

The text is a segment of text for which the model needs to perform military equipment entity extraction tasks.

The answer designs a template structure for the output of a large language model, enabling it to simulate the reasoning process to determine whether a word should be classified as an entity and which entity type it belongs to. Specifically, this structure requires providing an explanation for each extracted entity and its entity type.

In the fine-tuning phase, this study employs the mainstream efficient parameter tuning technique, Lora. Experimental results indicate that pre-trained models have inherently low-dimensional spaces. Lora fine-tuning achieves comparable results to full-parameter fine-tuning by learning low-dimensional parameters [20].

After constructing the training data, Lora is applied to fine-tune the base Llama3 model, resulting in a military equipment Llama3 model capable of entity extraction in military contexts.

4. Experimental Results

To validate the generalization and effectiveness of the entity extraction method proposed in this paper, experiments were conducted using the same approach on both a publicly available dataset in the general domain and a self-constructed dataset in the military equipment domain.

4.1. Datasets

In the context of publicly available datasets, this study selected widely used datasets including CoNLL03, OntoNotes5.0, ACE2004, and ACE2005. Here is a brief description of each dataset:

CoNLL03 [21]: An English dataset that includes entities categorized into four types: locations, organizations, person names, and miscellaneous.
OntoNotes5.0 [22]: An English dataset comprising text from various sources. It includes entities classified into eighteen categories, which encompass eleven text categories (such as person names, organizations, etc.) and seven numeric categories (such as dates, percentages, etc.).
ACE2004 [23] and ACE2005 [24]: Each of the two datasets contains seven entity categories, with annotations indicating their locations in the text and corresponding head words.

The military equipment dataset originates from unstructured text data on ships and aircrafts from a global military website, manually annotated to construct a dataset of 3000 high-quality military equipment records. This dataset encompasses twelve entity categories, including nine equipment categories and three others. The corresponding relationship between entity categories is shown in Table 1.

The dataset is partitioned into training, validation, and test sets in a 7:2:1 ratio. Following international conventions, submarine and ship categories are treated as parallel categories. Statistical information about this experimental dataset is provided in Table 2.

4.2. Evaluation Metrics

This study uses precision (P), recall (R), and F1 score as evaluation metrics to assess the performance of models and baseline methods. The calculation formulas are as follows:

P = \frac{T P}{T P + F P}

(6)

R = \frac{T P}{T P + F N}

(7)

F 1 = \frac{2 \times P \times R}{P + R}

(8)

where TP represents the number of correctly identified military equipment entities, FP represents the number of incorrectly identified military equipment entities, and FN represents the number of military equipment entities that were not identified.

4.3. Experimental Setup

The experimental environment settings used during the model experiments are shown in Table 3, and the hyperparameter settings are shown in Table 4.

4.4. Comparative Experimental Analysis

In the comparative experiments on public datasets, traditional deep learning models such as BiLSTM+CRF [12] and BERT-MRC [13] were chosen as benchmark models due to their effective performance in entity extraction tasks. Additionally, several LLM-based entity extraction models were selected as baselines to evaluate the entity extraction effectiveness of the proposed method, including two few-shot learning models GPTNER [17] and PromptNER [18], and one supervised fine-tuning model UniNER [19]. Table 5 presents the entity extraction performance of different models on the public datasets.

The experimental results indicate that among the traditional deep learning models mentioned, BERT-MRC exhibits the best performance in entity extraction. Due to the Transformer architecture’s global attention mechanism, BERT effectively captures long-distance dependencies. Moreover, BERT-MRC addresses entity nesting issues in entity extraction tasks by transforming the task into a machine reading comprehension format. Among all LLM-based models, UniNER demonstrates the best performance. Its strength lies in leveraging the large-scale language model ChatGPT for data annotation, enabling the extraction of deep semantic knowledge that enhances entity recognition accuracy. However, UniNER, using the smaller-scale base model Llama2 for fine-tuning, shows relatively weaker logical inference capabilities. Its performance in recognizing complex entities like nested structures still requires improvement. Additionally, fine-tuning large language models with extensive data may lead to catastrophic forgetting issues, thereby reducing entity extraction performance.

The entity extraction method proposed in this paper shows significant improvements over the best-performing baseline model, UniNER. Specifically, the F1 scores on the OntoNotes 5.0, ACE 2004, and ACE2005 datasets increased by 2.14, 0.81, and 0.43, respectively. These results demonstrate that incorporating chain of thought reasoning into training data using large-scale LLMs effectively enhances the model’s logical reasoning capabilities. Furthermore, fine-tuning smaller LLMs with the augmented data further adapts the model to the entity extraction task, thereby improving entity extraction performance.

In the comparison experiments on the military equipment dataset, the same six models were selected to validate the effectiveness of the CoTNER method in entity extraction tasks within the military equipment domain. Table 6 shows the entity extraction performance of all models on the self-built military equipment dataset.

The experimental results indicate that, with only 3000 pieces of military equipment domain data, models based on large language models (LLMs) significantly outperform those based on traditional deep learning methods in entity extraction performance. This is because traditional deep learning methods start from scratch and rely on domain-specific datasets, making it challenging to achieve good performance with limited data. In contrast, LLMs are pre-trained on extensive general domain corpora, incorporating rich semantic and syntactic knowledge to capture complex contextual relationships. Thus, even with limited domain-specific data, they can still provide robust foundational representations.

Among the three LLM-based models, the proposed CoTNER method achieves the best results with scores of 88.93, 91.25, and 90.07 on the three evaluation metrics. CoTNER enhances model knowledge by annotating training data with GPT-3.5 and introduces thinking chains in training data to enhance the logical reasoning ability of Llama3. Finally, through data filtering strategies, CoTNER selects a small amount of high-quality training data to help the model retain previous pre-trained knowledge while learning new tasks, thereby improving effectiveness and stability.

4.5. Ablation Experiments

To further validate the contributions and impacts of different modules within the CoTNER method on entity extraction tasks in the military equipment domain, this study conducted ablation experiments. Specifically, it explored the effects of the fine-tuning with instructions module, the data augmentation with the chain of thought module, and the high-quality data filtering module on the model’s performance.

Table 7 presents the results of the model on the ordinary and nested entity extraction test sets. The models in the “Fully supervised” row utilize the instruction fine-tuning learning paradigm, whereas those in the “Few-shot” row do not employ the instruction fine-tuning paradigm. The experimental results are analyzed as follows:

(1) For the few-shot learning paradigm, Llama3 + CoT achieves performance of 85.77 and 69.36 on the CoNLL2003 and ACE2005 datasets, respectively. In contrast, when using the fully supervised learning paradigm, Llama3 + CoT results improve to 92.88 and 87.79 on the CoNLL2003 and ACE2005 datasets, respectively. This indicates that instruction fine-tuning is crucial for entity extraction tasks, particularly evident on the more complex nested ACE2005 dataset. Through fine-tuning on annotated data, the model effectively learns the diversity of complex entities, context relationships, and features, including nested and discontinuous entities, thereby achieving more precise identification of complex entity boundaries.

(2) After integrating the data augmentation with the chain of thought module into the base Llama3 model, Llama3 + CoT shows an increase in F1 score from 89.32 to 92.88 on the CoNLL2003 dataset and from 82.43 to 87.79 on the ACE2005 dataset. The experimental results demonstrate significant performance improvement by incorporating the data augmentation with the chain of thought module in prompts and training data, with more pronounced enhancements observed on nested datasets.

(3) Based on the application of the fine-tuning with instructions module and the data augmentation with the chain of thought module, the model achieved F1 scores of 92.88 and 87.79 on the CoNLL2003 and ACE2005 datasets, respectively, without the high-quality data filtering module. However, after incorporating the high-quality data filtering module, the F1 scores increased to 93.12 and 88.03, respectively. These results demonstrate that the addition of the high-quality data filtering module can further enhance the model’s performance.

Additionally, subsets containing the top 5%, 10%, 15%, and 20% of training data based on IFD scores were constructed, and the base Llama3 model was fine-tuned using these subsets. Performance changes were analyzed and compared. Figure 4 illustrates the variation in model performance scores with increasing data volume, calculated as the ratio of the F1 scores obtained on the subset to those obtained on the entire dataset. Experimental results indicate that using the top 10% of high-quality data based on IFD scores allows the proposed model to surpass the performance achieved when trained on the entire dataset, demonstrating that the high-quality data filtering strategy effectively mitigates the issue of catastrophic forgetting.

5. Conclusions

The task of entity extraction in the military equipment domain is challenged by the complexity and diversity of equipment entity names. This paper proposes the CoTNER entity extraction method tailored for the military equipment domain, leveraging the superior semantic understanding of large-scale LLMs and the lower computational costs of small-scale LLMs. GPT3.5-175B is used for data augmentation via chain of thought to enrich the original data with additional semantic information and enhance the model’s logical reasoning capabilities. Subsequently, the enhanced data are used to fine-tune the Llama3-8B model through instruction fine-tuning training, aiming to develop a Llama3 model specialized in military equipment with domain-specific knowledge for entity extraction tasks. Additionally, the paper introduces a high-quality data filtering strategy to mitigate catastrophic forgetting during LLM fine-tuning, validating the effectiveness of the approach through experiments.

Author Contributions

Conceptualization, X.L. (Xuhong Liu) and Z.Y.; methodology, X.L. (Xuhong Liu) and Z.Y.; software, X.L. (Xuhong Liu) and Z.Y.; validation, X.L. (Xuhong Liu), Z.Y. and L.M.; formal analysis, X.L. (Xuhong Liu); investigation, Z.Y.; resources, T.Y.; data curation, L.M.; writing—original draft preparation, X.L. (Xuhong Liu) and Z.Y.; writing—review and editing, X.L. (Xuhong Liu) and Z.Y.; visualization, T.Y.; supervision, X.L. (Xiulei Liu); project administration, Z.Y. and L.M.; funding acquisition, Z.Y. and X.L. (Xiulei Liu). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by National Key Research and Development Program (Grant No. 2021YFB2600600) and the Research on the Detection of Government-Hired Internet Trolls in Foreign Social Networks (Grant No. KM202411232002).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data for this study include two parts. The first part is publicly available: the CoNLL03 dataset can be found at https://www.clips.uantwerpen.be/conll2003/ner/ (accessed on 3 May 2024), the OntoNotes5.0 dataset at https://catalog.ldc.upenn.edu/LDC2013T19 (accessed on 9 May 2024), the ACE2004 dataset at https://catalog.ldc.upenn.edu/LDC2005T09 (accessed on 4 May 2024), and the ACE2005 dataset at https://catalog.ldc.upenn.edu/LDC2006T06 (accessed on 4 May 2024). The second part is a self-constructed dataset specific to the military domain, which was generated and recorded by ourselves. These data do not originate from a publicly available repository. The authors confirm that the data supporting the findings of this study are available within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chowdhery, A.; Narang, S.; Devlin, J.; Bosma, M.; Mishra, G.; Roberts, A.; Barham, P.; Chung, H.W.; Sutton, C.; Gehrmann, S.; et al. PaLM: Scaling Language Modeling with Pathways. J. Mach. Learn. Res. 2023, 24, 1–113. [Google Scholar]
Moslem, Y.; Haque, R.; Kelleher, J.D.; Way, A. Adaptive Machine Translation with Large Language Models. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Tampere, Finland, 12–15 June 2023; Nurminen, M., Brenner, J., Koponen, M., Latomaa, S., Mikhailov, M., Schierl, F., Ranasinghe, T., Vanmassenhove, E., Vidal, S.A., Aranberri, N., et al., Eds.; European Association for Machine Translation: Tampere, Finland, 2023; pp. 227–237. [Google Scholar]
Li, J.; Wang, J.; Zhang, Z.; Zhao, H. Self-Prompting Large Language Models for Zero-Shot Open-Domain QA. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 296–310. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models Are Few-Shot Learners. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1877–1901. [Google Scholar]
Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar]
Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Ichter, B.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
Collins, M.; Singer, Y. Unsupervised Models for Named Entity Classification. In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, MD, USA, 21–22 June 1999. [Google Scholar]
Mikheev, A.; Moens, M.; Grover, C. Named Entity Recognition without Gazetteers. Association for Computational Linguistics. In Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics, Bergen, Norway, 8–12 June 1999. [Google Scholar] [CrossRef]
Bikel, D.M.; Miller, S.; Schwartz, R.; Weischedel, R. Nymble: A High-Performance Learning Name-Finder. In Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, DC, USA, 31 March–3 April 1997; Association for Computational Linguistics: Washington, DC, USA, 1997; pp. 194–201. [Google Scholar]
Li, W.; McCallum, A. Rapid Development of Hindi Named Entity Recognition Using Conditional Random Fields and Feature Induction. ACM Trans. Asian Lang. Inf. Process. 2003, 2, 290–294. [Google Scholar] [CrossRef]
Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural Language Processing (Almost) from Scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
Ma, X.; Hovy, E. End-to-End Sequence Labeling via Bi-Directional LSTM-CNNs-CRF. arXiv 2016, arXiv:1603.01354. [Google Scholar]
Li, X.; Feng, J.; Meng, Y.; Han, Q.; Wu, F.; Li, J. A Unified MRC Framework for Named Entity Recognition. arXiv 2019, arXiv:1910.11476. [Google Scholar]
Yuan, S.; Yang, D.; Liang, J.; Li, Z.; Liu, J.; Huang, J.; Xiao, Y. Generative Entity Typing with Curriculum Learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 3061–3073. [Google Scholar]
Wan, Z.; Cheng, F.; Mao, Z.; Liu, Q.; Song, H.; Li, J.; Kurohashi, S. GPT-RE: In-Context Learning for Relation Extraction Using Large Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Singapore, 2023; pp. 3534–3547. [Google Scholar]
Wang, X.; Li, S.; Ji, H. Code4Struct: Code Generation for Few-Shot Event Structure Prediction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Toronto, ON, Canada, 2023; pp. 3640–3663. [Google Scholar]
Wang, S.; Sun, X.; Li, X.; Ouyang, R.; Wu, F.; Zhang, T.; Li, J.; Wang, G. GPT-NER: Named Entity Recognition via Large Language Models. arXiv 2023, arXiv:2304.10428. [Google Scholar]
Ashok, D.; Lipton, Z.C. PromptNER: Prompting For Named Entity Recognition. arXiv 2023, arXiv:2305.15444. [Google Scholar]
Zhou, W.; Zhang, S.; Gu, Y.; Chen, M.; Poon, H. UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition. arXiv 2023, arXiv:2308.03279. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Tjong Kim Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Edmonton, AB, Canada, 31 May—1 June 2003; pp. 142–147. [Google Scholar]
Pradhan, S.; Moschitti, A.; Xue, N.; Ng, H.T.; Björkelund, A.; Uryupina, O.; Zhang, Y.; Zhong, Z. Towards Robust Linguistic Analysis Using OntoNotes. In Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, Bulgaria, 8–9 August 2013; Hockenmaier, J., Riedel, S., Eds.; Association for Computational Linguistics: Sofia, Bulgaria, 2013; pp. 143–152. [Google Scholar]
Doddington, G.; Mitchell, A.; Przybocki, M.; Ramshaw, L.; Strassel, S.; Weischedel, R. The Automatic Content Extraction (ACE) Program—Tasks, Data, and Evaluation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal, 26–28 May 2004; Lino, M.T., Xavier, M.F., Ferreira, F., Costa, R., Silva, R., Eds.; European Language Resources Association (ELRA): Lisbon, Portugal, 2004. [Google Scholar]
Christopher, W.; Stephanie, S.; Julie, M.; Kazuaki, M. ACE 2005 Multilingual Training Corpus LDC2006T06; Linguistic Data Consortium: Philadelphia, PA, USA, 2006; 1572864 KB. [Google Scholar]

Figure 1. A framework diagram of the CoTNER method.

Figure 2. The original dataset enhanced with chain of thought data augmentation using GPT-3.5.

Figure 3. An example of prompt construction.

Figure 4. Trend of model performance with varying amounts of filtered data.

Table 1. Entity category relationship correspondence.

	Category
Equipment	Ship
	Aircraft
	Missile
	Shells
	Radar
	System
	Submarine
	Torpedo
	Engine
Others	Person
	Location
	Country

Table 2. Statistics of military equipment dataset.

Metrics for Military Equipment Dataset	Value
Number of Sentences	3000
Average Sentence Length	56.41
Average Number of Named Entities per Sentence	2.4
Proportion of Sentences Containing Complex Named Entities	16.1%

Table 3. Experimental environment setting.

Configuration	Specifications
Operating System	CentOS 7.9
Python	3.8
Pytorch	1.13
RAM	32 GB
GPU	2 × RTX3090, 24 GB each

Table 4. Experimental hyperparameters setting.

Hyperparameters	Value
Learning Rate	0.001
Per Device Train Batch Size	4
Gradient Accumulation Steps	4
r	8
Lora Alpha	32
Loar Dropout	0.1
Zero Stage	2
Training Set Ration	70%
Validation Set Ration	20%
Test Set Ration	10%

Table 5. A comparison of F1 scores on public datasets.

Model	CoNLL2003	OnteNotes5.0	ACE2004	ACE2005
BiLSTM+CRF	91.03	86.28	-	-
BERT-MRC	93.04	91.11	85.98	86.88
GPTNER	90.91	82.2	74.2	73.59
PromptNER	83.48	-	-	-
UniNER	-	89.1	87.5	87.6
Ours	93.12	91.24	88.31	88.03

Table 6. A comparison of experimental results on military equipment datasets.

Model	Precision	Recall	F1
BiLSTM+CRF	79.77	81.77	80.76
BERT-MRC	82.58	83.42	83.02
GPTNER	83.29	85.11	84.20
PromptNER	84.36	83.44	83.91
UniNER	86.21	85.08	85.64
CoTNER	88.93	91.25	90.07

Table 7. A comparison of F1 scores in ablation study results.

	Method	CoNLL2003	OnteNotes5.0	ACE2004	ACE2005
Fully supervised	Llama3	89.32	90.24	82.27	82.43
	Llama3 + CoT	92.88	89.85	87.86	87.79
	Llama3 + CoT + Data filtering	93.12	91.24	88.31	88.03
Few-shot	Llama3 + CoT	85.77	81.01	70.16	69.36

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Yu, Z.; Liu, X.; Miao, L.; Yang, T. Military Equipment Entity Extraction Based on Large Language Model. Appl. Sci. 2024, 14, 9063. https://doi.org/10.3390/app14199063

AMA Style

Liu X, Yu Z, Liu X, Miao L, Yang T. Military Equipment Entity Extraction Based on Large Language Model. Applied Sciences. 2024; 14(19):9063. https://doi.org/10.3390/app14199063

Chicago/Turabian Style

Liu, Xuhong, Zhipeng Yu, Xiulei Liu, Lin Miao, and Tao Yang. 2024. "Military Equipment Entity Extraction Based on Large Language Model" Applied Sciences 14, no. 19: 9063. https://doi.org/10.3390/app14199063

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Military Equipment Entity Extraction Based on Large Language Model

Abstract

1. Introduction

2. Related Works

3. Method

3.1. Data Augmentation with Chain of Thought

3.2. High-Quality Data Filtering

3.2.1. Training of Empirical Llama3 Model

3.2.2. Instruction-Following Difficulty Score

3.3. Model Instruction Fine-Tuning

4. Experimental Results

4.1. Datasets

4.2. Evaluation Metrics

4.3. Experimental Setup

4.4. Comparative Experimental Analysis

4.5. Ablation Experiments

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI