Next Article in Journal
Experimental and Modeling Study for the Solar-Driven CO2 Electrochemical Reduction to CO
Previous Article in Journal
Fall Detection Based on Continuous Wave Radar Sensor Using Binarized Neural Networks
Previous Article in Special Issue
A Rule-Based Parser in Comparison with Statistical Neuronal Approaches in Terms of Grammar Competence
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Research on Co-Interactive Model Based on Knowledge Graph for Intent Detection and Slot Filling

1
School of Management Engineering, Shandong Jianzhu University, Jinan 250101, China
2
School of Science, Shandong Jianzhu University, Jinan 250101, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2025, 15(2), 547; https://doi.org/10.3390/app15020547
Submission received: 7 December 2024 / Revised: 2 January 2025 / Accepted: 6 January 2025 / Published: 8 January 2025
(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications—2nd Edition)

Abstract

:
Intent detection and slot filling tasks share common semantic features and are interdependent. The abundance of professional terminology in specific domains, which poses difficulties for entity recognition, subsequently impacts the performance of intent detection. To address this issue, this paper proposes a co-interactive model based on a knowledge graph (CIMKG) for intent detection and slot filling. The CIMKG model comprises three key components: (1) a knowledge graph-based shared encoder module that injects domain-specific expertise to enhance its semantic representation and solve the problem of entity recognition difficulties caused by professional terminology and then encodes short utterances; (2) a co-interactive module that explicitly establishes the relationship between intent detection and slot filling to address the inter-dependency of these processes; (3) two decoders that decode the intent detection and slot filling. The proposed CIMKG model has been validated using question–answer corpora from both the medical and architectural safety fields. The experimental results demonstrate that the proposed CIMKG model outperforms benchmark models.

1. Introduction

Spoken language understanding [1], which serves as the core task in task-oriented dialogue systems [2,3], aims to transform users’ natural language inputs into structured semantic representations [4]. The effectiveness of spoken language understanding directly affects the overall performance of a dialogue system [1]. Intent detection and slot filling, two fundamental subtasks in spoken language understanding, are crucial for comprehending and processing user requests [5]. Intent detection is generally regarded as a sentence-level text classification task that aims to understand the underlying intent of a user’s request and categorize the given utterance into predefined intent categories [6,7]. Slot filling can be viewed as a character-level sequence classification task that aims to extract specific semantic concepts from the utterance [6,7]. The accurate identification of intents and slots can enhance spoken language understanding, which in turn improves the overall performance of the dialogue system.
Intent detection and slot filling share a common foundation in language understanding and are highly dependent on each other. Given the utterance, “What causes high Hb1c levels in patients?”, its intent category is “Request-Etiology”, and the slot is labelled “B-Check”. An example of intent detection and slot filling for this utterance is shown in Figure 1.
In this example, when the dialogue system recognizes the user’s intent category as “Request-Etiology”, the detected intent guides the system to focus on the “HbA1c”, which is then classified into the named entity category of the “Medical_Examination”. Simultaneously, after the system recognizes “HbA1c” as a specific medical examination, it further detects the user’s intent to inquire about the specific reasons for any abnormalities in the examination results. This example demonstrates that intent detection can facilitate the determination of the slot type, and the information provided by the slot can, in turn, help the dialogue system more accurately detect the user’s intent.
Owing to the strong correlation between intent detection and slot filling, some joint models based on multi-task learning frameworks have been widely proposed to simultaneously tackle both intent detection and slot filling [8]. However, certain existing joint models [9,10,11] exclusively employ intent information in a unidirectional fashion to optimize slot filling while overlooking the potential guiding influence that slot information could exert on intent detection [12]. In response to the shortcomings of existing joint models, Qin et al. [12] introduced a co-interactive transformer specifically designed for joint intent detection and slot filling. This transformer utilizes the bidirectional information flow between intent detection and slot filling to model the relationship between the two tasks.
However, in short utterances lacking domain-specific contexts, challenges emerge in terms of ambiguous intent detection and difficulties in domain-specific entity recognition, specific abbreviations, and easily confused entity recognition. For example, in response to a user’s request, such as “What causes high Hb1c levels in patients?”, a general-domain model may not understand the meaning of “HbA1c”, which can easily result in errors in entity recognition. In fact, “HbA1c” is a type of glycated hemoglobin in the medical field that reflects the average blood glucose level over the past 2 to 3 months. In the utterance “What preparations are needed before undergoing a BMP test?”, “BMP” refers to the basic metabolic panel, which belongs to the category of “Medical_Examination”. However, in the utterance “What is the role of BMP in bone repair?”, “BMP” refers to a bone morphogenetic protein, which is the category of “Drug”. Clearly, the meaning of “BMP” differs in the two aforementioned utterances. Therefore, they are prone to cause errors in entity recognition without supplementary knowledge in the medical field, especially for domain-specific entities or abbreviations. When modelling intention detection and slot filling jointly, incorrect results from entity recognition can directly affect intention detection, subsequently leading to a decline in the overall performance of both intention detection and slot filling. Moreover, in the specialized domain, current joint models for intent recognition and slot filling [9,10,11] merely associate the relevant information between intents and slots. They have not proposed effective solutions to address the issues that may arise in the joint process of intent recognition and slot filling, such as domain-specific entity recognition, specific abbreviations, and easily confused entity recognition.
In order to effectively utilize domain-specific knowledge in the joint modelling of intent detection and slot filling, we propose a co-interactive model based on a knowledge graph (CIMKG) for intent detection and slot filling. To enhance the overall performance of intent detection and slot filling tasks in specific domains, the model constructs a representation of the relationship between these two tasks by integrating triples sourced from the knowledge graph into utterances while taking into account the interdependencies that exist between these two tasks.

2. Related Work

2.1. Pipeline Models

Currently, studies on intent detection and slot filling include pipeline and joint models [6]. Pipeline models refer to the separate implementation of intent detection and slot filling without establishing an association between the two tasks [1].
Intent detection, as a text classification task, has evolved from utilizing labelled training data and training using shallow machine learning models, such as the SVM [13], improved KNN model, and MLKNN [14], to training using deep learning models including RNN [15], IndRNN-attention [16], and Hybrid ELMo [17]. For the slot filling task, traditional approaches rely on methods such as conditional random fields (CRFs) for predicting slot labels. In recent years, neural network and their extensions [18,19,20,21,22,23] have demonstrated superior performance in slot filling.
However, pipeline methods model intent detection and slot filling tasks separately, ignoring the correlation between the two tasks, which can lead to error propagation problems [1]. To take advantage of the strong correlation between intent detection and slot filling, joint models have been proposed [24].

2.2. Joint Models

Joint models are methods that simultaneously address intent detection and slot filling tasks. By sharing semantic features and model parameters, the joint models exploit the correlations between the two tasks to enhance overall performance. Joint models can be further classified into implicit, unidirectional interrelated, and bidirectional interrelated models [1,24].
Implicit joint models are modelling approaches that implicitly learn the correlation between intent detection and slot filling through shared parameters [7,12]. Liu et al. [25] proposed a neural network model based on the attention mechanism that uses a bidirectional recurrent neural network (BiRNN) as a shared encoder to simultaneously learn intent detection and slot filling tasks. Chen et al. [6] proposed Joint BERT, which concurrently considers the mutual influence between intent detection and slot filling. By combining the loss functions of the two tasks during the fine-tuning phase, Joint BERT not only saves training time but also improves the performance of intent detection. However, Joint BERT optimization of the joint loss function suffers from a lack of interpretability [24].
Unidirectional interrelated models refer to models that use intent information to explicitly guide slot filling or use slot information to explicitly guide intent detection [7]. Goo et al. [9] proposed a slot-gating mechanism to establish a connection between intent detection and slot filling tasks. By integrating the output of intent detection as additional information into the decision-making process of slot filling, the model’s understanding of text semantics during entity recognition is enhanced, leading to improved performance on the slot filling task [11,26]. Qin et al. [11] proposed a Stack Propagation framework that uses intent information to guide the slot filling task to better capture semantic knowledge. Dao et al. [26] proposed JointIDSF for Vietnamese based on Joint BERT [6], which introduces an intent-slot attention layer that explicitly incorporates intent information into slot filling through “soft” intent label embedding. However, the above models only consider the information transfer from intent detection to slot filling and do not consider the information transfer in the opposite direction, failing to make full use of the bidirectional interactive information between the intent and slot, resulting in the limited performance of the model for intent detection and slot filling.
In contrast, bidirectional interrelated models, by considering information transfer in both directions, can better capture the interdependencies between two tasks, thus improving the overall performance of intent detection and slot filling. Cao et al. [27] proposed a Character–Word Information Interaction Framework (CWIIF) specifically for natural language understanding in the Chinese medical dialogue domain. Qin et al. [12] proposed a co-interactive module for joint intent detection and slot filling to address the limitations of a single information flow. The core of this model is the interaction between intent representations and slot representations within a co-interactive attention layer, resulting in intent-aware slot representations and slot-aware intent representations. The co-interactive module establishes a bidirectional connection between the two tasks, thereby enhancing the performance of intent detection and slot filling.
Joint models fully exploit the strong correlations between intent detection and slot filling, which offers superior performance compared to pipeline methods. However, in specific domains, a lack of domain-specific knowledge often leads to errors in entity recognition, particularly for domain entities or abbreviations. Coupled with the influence of joint models, these errors in entity recognition can directly affect intent detection, subsequently causing a decline in the overall performance of both intent detection and slot filling.

2.3. Knowledge Graph-Based Models

Combining knowledge graphs with intent detection and slot filling allows models to learn domain-specific relevant knowledge in advance and relies on this prior knowledge to improve training effectiveness. Li et al. [28] addressed the challenge of multimodal named entity recognition difficulties in specific domains by proposing the Knowledge Graph Multimodal Named Entity Recognition (KGMNER) model, which enhances the performance in identifying domain-specific entities, abbreviations, or easily confused entities in short texts with missing contexts. Liu et al. [29] proposed the K-BERT model to address the problem of pretrained language models (such as BERT) underperforming in knowledge-driven tasks in specific domains. K-BERT enhances the performance of pretrained language models in specific domains by injecting triples from a knowledge graph. Additionally, it employs a soft position and visible matrix to overcome problems related to knowledge noise. Although the above model incorporates domain knowledge from the knowledge graph, it primarily models the knowledge graph separately from intent detection and slot filling, ignoring the interplay between the intent and slot.
After comprehensively considering the research findings from the aforementioned literature, we adopted the approach from reference [29] to align and inject entities from the utterance with those in the knowledge graph for intent detection and slot filling in the specific domain. On this basis, we employed the methodology from reference [12] to jointly model intent detection and slot filling. Specifically, we effectively leveraged domain-specific professional knowledge to address the challenge of accurately identifying specific abbreviations or easily confused entities by incorporating triples from the knowledge graph into utterances. Simultaneously, we capture the bidirectional information flow between intent and slot information to enhance the overall performance of intent detection and slot filling in a specific domain by modelling the relationship between the two tasks through a joint model.

3. CIMKG

The CIMKG consists of three main components: a knowledge graph-based shared encoder module; a co-interactive module that explicitly establishes a bidirectional connection between intent detection and slot filling; and two decoders specifically designed for the two tasks. The overall framework of the CIMKG is shown in Figure 2. In Figure 2, the red lowercase letters (a–f) denote the readable sequence, whereas the black numerals (0–9) represent the actual sequence.
The primary objective of this study is to employ K-BERT as the encoding layer of the CIMKG model, where K-BERT possesses the capability to process a sentence tree with knowledge graph triples imported from knowledge graphs to ensure the effectiveness of knowledge integration through the utilization of soft position and visible matrix techniques embedded within K-BERT. Subsequently, the co-interactive module contained in reference [12] is incorporated in CIMKG to enable the bidirectional flow of intent information and slot information, which allows the intent information to guide slot filling while simultaneously enabling slot information to provide guidance for intent detection. The model was initialized with BERT pretrained word embeddings and optimized using the Adam algorithm. The model was evaluated on standard datasets, and adjustments were made to the parameters or architecture design to further optimize the model’s performance based on the evaluation results.

3.1. Knowledge Graph-Based Shared Encoder Module

The knowledge graph-based shared encoder module mainly consists of four key components: the knowledge layer, embedding layer, seeing layer, and mask transformer. First, the knowledge layer imports ontology labels related to the input utterances and transforms them into a knowledge-rich sentence tree. Second, this sentence tree is processed in parallel, where it is converted into character-level embedding vectors at the embedding layer to capture deep semantic relationships, and transformed into a matrix representation at the seeing layer to control the visibility range of each token. Finally, the mask transformer integrates the embedding representations and visibility information, effectively capturing long-distance dependencies and complex patterns. The workflow of the shared encoder based on knowledge graphs is shown in Figure 3.
The primary function of this module is to integrate semantic information, such as entities and relationships, from the domain knowledge graph into the encoding process, enabling the model to grasp the meaning expressed by short texts more accurately when processing utterances in specific domains.

3.1.1. Knowledge Layer

In the knowledge layer, entities from the domain-specific knowledge graph are linked to entities in the input utterance to construct a sentence tree that is rich in semantic information. This process primarily consists of two steps: knowledge query (K-query) and knowledge injection (K-injection). For an input sentence s = w 1 , w 2 , , w n and a given domain-specific knowledge graph K , the output sentence tree is represented as T = w 1 , w 2 , , w i r i 1 , w i 1 , , r i k , w i k , , w n 1 i n after undergoing knowledge query and knowledge injection. Here, n denotes the number of tokens in the query, w i represents an entity, w i k represents an ontology label, and r i k represents the relationship between the intent and slot.
The knowledge query process involves retrieving and acquiring knowledge triplets from domain-specific knowledge graph K that are relatend to all named entities in utterance s . This process can be formally described as shown in (1):
E = K _ Q u e r y ( s , K )
where E = w i , r i 1 , w i 1 , , w i , r i k , w i k is a collection of corresponding triples related to the entities in utterance s .
In the knowledge-injection process, the acquired knowledge triplets E are injected into the corresponding positions of the utterance s to obtain a sentence tree T . This process can be described as shown in (2):
T = K _ I n j e c t ( s , E )
Using “What causes high HbA1c levels in patients?” as an example, the sentence tree structure formed after importing knowledge graph triplets is shown in Figure 4.
Here, [ C L S ] is a classification token for intent detection, the red numbers represent the readable order, and the black numbers represent the actual order. The knowledge sentence after the injection of the knowledge graph is as follows.
What causes high Hb1c is glycated hemoglobin for glucose management levels in patients?

3.1.2. Embedding Layer

The embedding layer converts the sentence tree into the embedding representation required by the mask transformer. Although the injection of ontology labels enriches the semantics of the sentence tree, it simultaneously alters the structure of the input utterance, rendering it unreadable. The positional embeddings in the original BERT primarily provide the model with the ability to recognize the positions of individual tokens within the input sequence, ensuring that the model can correctly understand the sequential information of the tokens in the utterance. However, these positional embeddings do not address the problem of sentence tree unreadability. To address the problem of sentence structure changes caused by the introduction of external knowledge, the K-BERT model employs a soft position and a visible matrix.
Soft position embedding works by assigning each token a dynamically adjusted position index, which needs to be set according to the actual semantics and is reflected through the self-attention scores in the mask transformer to indicate the relative positional relationships between tokens, rather than their absolute physical positions. Considering the sentence tree in Figure 3 as an example, the vector representations obtained after the text embedding layer are shown in Figure 5. In Figure 5, “+”sign is used to denote the summation of different types of embedding vectors and A is used as an identifier to mark sentence segments.

3.1.3. Seeing Layer

Injecting the triplets from the knowledge graph into the original utterance may lead to distortions in the semantics of the utterance without proper constraints. The seeing layer processes the structured information within the sentence tree and generates a visible matrix, which is used to control the visibility scope of each token in the sentence tree. Taking the sentence tree in Figure 4 as an example, “glycated hemoglobin” is only used to explain “Hb1c” and is not related to “levels”. Therefore, “glycated hemoglobin” is visible to “Hb1c” but invisible to “levels”. The entity visibility matrix corresponding to the sentence tree in Figure 4 is shown in Figure 6.
The entity visible matrix M can be defined as shown in (3):
M i j = 0                 w i w j             w i w j
where w i w j indicates that w i is related to w j and w i w j indicates that w i is unrelated to w j . In this entity visible matrix, the element in row 6 and column 4 is 0, indicating that “glycated hemoglobin” is related to “Hb1c” in the sentence tree. The element in row 6 and column 9 is , indicating that “glycated hemoglobin” is not related to “levels” in the sentence tree.

3.1.4. Mask Transformer Layer

The mask transformer is primarily responsible for encoding the embedding representations of the sentence tree and the entity visible matrix M . In contrast, the transformer layer in the standard BERT model can only encode the embedding representations and cannot process the visible matrix. Therefore, by modifying the relevance calculation function within the self-attention mechanism of the original transformer layer, it is possible to restrict the visible range of each token based on the entity visible matrix M , thereby preventing semantic information from being altered by the injection of external knowledge. Specifically, the relevance scores are combined with the entity–relationship visible matrix M , and the final relevance degrees are calculated using the softmax function.
The original self-attention parameters are kept unchanged, as shown in (4).
Q i + 1 , K i + 1 , V i + 1 = h i W q , h i W k , h i W v
where W q ,   W k ,   W v are trainable parameters; h i represents the hidden layer output of the i -th layer; Q i denotes the query vector of the i -th layer; K i represents the key vector of the i -th layer; and V i is the value vector of the i -th layer.
Subsequently, the softmax function was modified by injecting the entity–relationship visible matrix M as shown in (5):
S i + 1 = s o f t m a x Q i + 1 K i + 1 T + M d k
where M represents the entity visible matrix, S i denotes the relevance score of the i -th layer, and d k denotes a predefined scaling factor in the model. Finally, the hidden layer output for the next layer is updated as shown in (6).
h i + 1 = S i + 1 V i + 1

3.2. Co-Interactive Module

The co-interactive module aims to establish a bidirectional connection between intent detection and slot filling and consists of three key components: an intent and slot label attention layer, a co-interactive attention layer, and an extended feed-forward network layer [12].
First, the intent and slot label attention layers are used to obtain slot and intent representations. Second, the co-interactive attention layer replaces traditional self-attention, explicitly establishing a relationship between intent and slots. Finally, the extended feed-forward network layer implicitly combines intent and slot information. This process captures and integrates intent and slot information through both explicit and implicit methods, thereby establishing a bidirectional connection between intent and slot information.

3.2.1. Intent and Slot Label Attention Layer

The intent and slot label attention layer uses a label attention mechanism to compute the label attention for the hidden state of each token in the input sequence, thereby obtaining explicit representations of the intent and slots.
Specifically, unlike the hidden states H of the input sequence used in [12], the hidden states H that are obtained after the knowledge graph-based shared encoding layer not only include the token ids and seg-ids for each token in the input sequence but also include visible mask information. The hidden states H R n × d ( n represents the number of tokens in the input sequence, d represents the hidden layer dimension) are used as queries, whereas the label embedding matrix W V R d × v l a b e l ( v I o r S , where I represents intent, S represents slot, and v l a b e l represents the number of labels for either intent or slot), serves as a key and value. The hidden state H is dotted with the label embedding matrix W V , and the result is normalized using the softmax function to obtain the attention weight matrix A as shown in (7):
A = s o f t m a x H W V
By performing a linear combination of the attention weight matrix A and label embedding matrix W V , the original hidden state H is updated to obtain the final intent or slot representation H V as shown in (8):
H V = H + A W V
In particular, the label embedding matrix W V is represented by the parameters of the fully connected slot filling decoder layer and intent detection decoder layer. This means that the intent detection embedding matrix W I R d × I l a b e l and the slot filling embedding matrix W S R d × S l a b e l are actually the weight matrices of the decoder layers ( S l a b e l   a n d   I l a b e l represent the number of slots and intent labels, respectively). Through the above calculation, the intent representation H I R n × d and slot representation H S R n × d , which capture semantic information, are obtained.

3.2.2. Co-Interactive Attention Layer

The co-interactive attention layer explicitly establishes a bidirectional relationship between intent detection and slot filling tasks. The co-interactive attention layer enables intent detection and slot filling to interact with each other, so that the slot representation H S is complemented by the intent information and the intent representation H I is enhanced by the slot information.
Specifically, the intent representation H I and the slot representation H S , obtained from the intent and slot label attention layers, are mapped through different linear transformations to form query ( Q I , Q S ), key ( K I , K S ), and value ( V I , V S ) matrices. Using Q I as the query, K S as the key, and V S as the value, the dot product of the query vector Q I and the transpose vector K S T of the slot key vector K S is computed, scaled, and then passed through the softmax function. This is then multiplied by the slot value vector V S to obtain the intent attention vector C I containing slot information as shown in (9).
C I = s o f t m a x Q I K S T d k V S
The updated intent representation H I R n × d , which incorporates slot information, is obtained by adding C I to the original intent representation H I and then applying layer normalization as shown in (10).
H I = L N H I + C I
where LN denotes the layer normalization function.
By aligning the intent with its corresponding slot information, a representation of the intent that contains the corresponding slot information is obtained. The degree of association between the intent and slot is measured using the attention weights, and the intent representation is adjusted to obtain the most relevant intent representation of the slot information.
Similarly, using Q S as the query, K I as the key, and V I as the value, the slot representation H S R n × d containing intent information can be obtained.

3.2.3. Extended Feed-Forward Network Layer

The extended feed-forward network layer is extended to implicitly fuse the intent representation H I with the slot representation H S , to form a new representation H I S that contains both intent and slot information as shown in (11):
H I S = H I H S
where H I S = h I S 1 , h I S 2 , , h I S n and ⨁ represents concatenation.
To better understand the meaning expressed by each token, the features of adjacent tokens are combined to form a new feature vector h ( f , t ) [6] containing contextual information, which is represented as shown in (12):
h f , t t = h I S t 1 h I S t h I S t + 1
Finally, the FFN layer fuses the intent information with the slot information to obtain the results shown in (13)–(15):
F F N H f , t = max 0 , H f , t W 1 + b 1 W 2 + b 2
H ^ I = L N H I + F F N H f , t
H ^ S = L N H S + F F N H f , t
where H ( f , t ) = h ( f , t ) 1 , h ( f , t ) 2 , , h ( f , t ) t ; H ^ I and H ^ S are the intent and slot representations, respectively, which are obtained through the fusion of information at the FFN layer and subsequent layer normalization.

3.3. Decoder for Slot Filling and Intent Detection

To ensure sufficient information interaction between intent detection and slot filling, reference [12] employed a multi-layer stacked co-interactive attention network. After K layers of stacking, the updated intent representation H ^ I ( K ) and slot representation H ^ S ( K ) are obtained as shown in (16) and (17):
H ^ I ( K ) = h ^ ( I , 1 ) ( K ) , h ^ ( I , 2 ) ( K ) , , h ^ ( I , n ) ( K )
H ^ S ( K ) = h ^ ( S , 1 ) ( K ) , h ^ ( S , 2 ) ( K ) , , h ^ ( S , n ) ( K )
The max-pooling operation [30] is applied to H ^ I ( K ) to obtain the representation c of the entire sentence, which is passed as an input to a fully connected layer. The probability distribution of intent detection y ^ I is computed using the softmax function, and the index corresponding to the maximum value in the distribution is selected as the intent label o I as shown in (18) and (19):
y ^ I = s o f t m a x W I c + b S
o I = argmax y I
where W I and b S represent model training parameters.
A linear transformation is applied to H ^ S ( K ) , and a standard CRF layer [31] is used to model the dependencies between labels as shown in (20) and (21):
O S = W S H ^ S K + b S
P y ^ O S = i = 1 exp f y i 1 , y i , O S y i = 1 exp f y i 1 , y i , O S
where f y i 1 , y i , O S represent the transition score from y i 1 to y i , and y ^ represents the predicted label sequence.

4. Experimental Results and Analysis

4.1. Experimental Data

This paper conducts experimental analysis using two datasets from the medical field and the construction safety field.
The IMCS21 dataset [32] is a benchmark dataset designed for automatic medical consultation systems, consisting of a total of 4116 physician–patient dialogue case samples. It includes 2472 dialogues in the training set, 833 dialogues in one validation set, and 811 dialogues in another validation set. The IMCS21 dataset covers 10 pediatric diseases and encompasses five types of named entities and 16 dialogue intents. The complexity of the IMCS21 dataset lies primarily in its normalization of entities such as symptoms to address synonyms or different expressions. More than 1900 common symptoms have been standardized into 444 canonical names, facilitating the model’s ability to understand similar expressions and reducing ambiguities. To ensure data quality, some samples with incomplete information or too few dialogue turns were excluded from the original data in this paper. MedicalKG is a knowledge graph in the medical domain constructed by Liu et al. [29]. It contains four types of named entities, namely (symptoms, diseases, parts, and treatments) and comprises 13,864 triples.
The dataset used in the paper for the field of construction safety is self-constructed. The construction safety accident knowledge graph comprises a total of 1156 entities, 2200 relations, and 45,772 knowledge triples. In this knowledge graph, there are six categories of entities and five types of relationships. Based on the professional knowledge stored in the construction safety knowledge graph and the practical application scenarios provided by practitioners, a total of 2648 utterances related to construction safety were constructed under the guidance of experts. This utterance dataset included a total of five categories of intents.

4.2. Model Training

The experiment adopts the Adam optimizer to tune the parameters of the CIMKG model. The hyperparameter settings are shown in Table 1.
In Table 1, “batch size” refers to the number of samples processed in a single training iteration. “K-BERT encoder hidden units” denotes the number of hidden units in the knowledge graph-based shared encoder. “co-interactive module hidden units” indicates the number of hidden units in the co-interactive module. “co-interactive module number” represents the number of co-interactive modules. “attention_dropout” is the dropout rate for the attention layer parameters, and “learning rate” is the learning rate of the model.
The CIMKG model was trained using the hyperparameters listed in Table 1. The changes in the loss during the training process are shown in Figure 7.
The “Loss vs. Step” figure on the left shows the loss values of the model at each training iteration. It is evident that the loss during the initial stage of training was exceedingly high. However, as the training progressed, the loss value decreased rapidly and stabilized at a certain threshold. This indicates that as the curves eventually flatten out, the current model training tends towards stability. The “Average Loss per Epoch” figure on the right presents the average loss values for each training epoch. As illustrated in the figure, the average loss experiences a sharp decline within the first few epochs and subsequently slows down to a gradual and smooth decreasing trend. The stable low loss at the end of training suggests that the model effectively fitted the training data.

4.3. Experimental Result Analysis

To evaluate the effectiveness of the CIMKG model quantitatively, the F 1 score was used to assess the slot filling performance of the model. The accuracy was used to evaluate the intent detection capability of the model. Furthermore, the overall accuracy is adopted to measure the sentence-level semantic frame parsing, which represents the percentage of utterances in which both the intent labels and slot labels are correctly identified out of the total number of samples.
Additionally, to fully demonstrate the effectiveness of the CIMKG model, Joint BERT and BERT-DCA-Net [12] were introduced as baseline models to compare their performance with that of the CIMKG model in domain-specific intent detection and slot filling. The experimental results are listed in Table 2.
As shown in Table 2, the proposed CIMKG model outperformed the baseline models in terms of F 1 score, accuracy, and overall accuracy. Specifically, not only do the individual metrics of F 1 score and accuracy outperform those of the baseline models, but the overall accuracy is also higher. This indicates that the CIMKG model is capable of achieving a deeper understanding of semantic information in domain-specific utterances, thereby further enhancing the overall performance of the joint task of intent detection and slot filling.
Through a comparison between the Joint BERT and CIMKG models, it can be observed that the overall performance of the CIMKG model in processing short utterances within specialized domains surpasses that of Joint BERT. There are primarily two reasons for this superiority. Firstly, the CIMKG model enhances the ability of utterance representation by incorporating domain-specific knowledge into the utterance through the integration of a knowledge graph. According to the Slot ( F 1 ) results in Table 2, the experimental outcomes demonstrate that the CIMKG model outperforms Joint BERT in both the medical and construction safety domains. This indicates that the injection of knowledge graph triples into utterances can effectively address the challenges of identifying specialized or easily confused terminology within specialized domains. Secondly, the CIMKG model adopts a collaborative interactive approach for simultaneous bidirectional communication and guidance between intent detection and slot filling, which is superior to the method employed by Joint BERT, which relies solely on parameter sharing to implicitly learn the correlation between intent detection and slot filling. While slots are being identified, the results of slot detection also serve as guiding information for the intent recognition process, and, similarly, the results of intent detection have an impact on the identification of slot filling. Similar reasons can also be observed from the comparison of experimental results between Joint BERT and BERT-DCA-Net in Table 2. When compared to BERT-DCA-Net, although both BERT-DCA-Net and CIMKG adopt a collaborative interactive mode for joint intent detection and slot filling, the incorporation of a knowledge graph in the CIMKG model enables it to better enhance the overall performance of intent detection and slot filling in specialized domains. Therefore, the CIMKG model outperforms BERT-DCA-Net in terms of Slot ( F 1 ), Intent (Accuracy), and Overall (Accuracy).
To further validate the performance of the proposed CIMKG, we conducted ablation experiments, and the results are shown in Table 3.
By removing the intent attention layer from the CIMKG, we obtained a new model named “without intent attention layer”, in which the original hidden states H are utilized to replace the hidden states H I that would have been processed by the intent attention layer. Similarly, another model referred to as “without slot attention layer” was obtained by solely removing the slot attention layer. As shown in Table 3, compared with the CIMKG, the performance of slot filling and intent detection in these two modules decreased, indicating that explicit intent and slot representations are crucial for the co-interactive layer between the two tasks. Subsequently, to verify the role of bidirectional information flow, only one direction of the information flow was retained. The models with only one direction of information flow were named the “with intent-to-slot” and the “with slot-to-intent”. As shown in Table 3, the CIMKG performed better with a bidirectional information flow configuration. The ablation experimental results indicate that modelling the interaction between intent detection and slot filling can enhance the performance of both tasks, further demonstrating the strong correlation between intent detection and slot filling.

5. Conclusions and Future Work

This study proposes the CIMKG, which jointly models intent detection and slot filling based on a knowledge graph. The CIMKG thoroughly addresses the problem of the underperformance of general pretrained language models in intent detection and slot filling within specific domains. Specifically, it not only employs a knowledge graph-based encoder module to tackle the problem of knowledge noise that may arise from the introduction of domain-specific expertise but also utilizes a co-interactive module to enable a bidirectional flow of intent and slot information. This fully exploits the interrelationship between intents and slots, thereby enhancing the overall performance of intent detection and slot filling in specific domains. Experiments were conducted using knowledge graphs in both the medical and construction safety domains and related user question–answer data. The experimental results demonstrate that CIMKG effectively addresses the problems of entity recognition and intent detection errors caused by insufficient prior knowledge in both the medical and architectural safety domains. Furthermore, CIMKG significantly improves intent detection and slot filling performance. However, in this study, the utilization of the IMCS21 dataset, which encompasses 16 categories of intents, renders the intent detection task relatively complex. Despite the adoption of the softmax function to predict the probability distribution across various categories, the multi-categories classification itself exacerbates the difficulty for the model to distinguish between different intents, consequently leading to a relatively low accuracy in Intent (Acc). Furthermore, Overall (Acc) reflects the scenario where both the intents and slots are correctly predicted. Hence, it not only relies on the accuracy of intent detection but is also influenced by the outcome of slot filling. Consequently, the overall experimental results may not be as high as anticipated. In addition, it should be noted that this study focused only on single-intent scenarios. Future research will build upon this foundation to investigate and improve the model’s ability to handle multi-intent scenarios. Specifically, a dedicated multi-intent processing module will be designed on the basis of the CIMKG. This module will be capable of identifying multiple intents within an utterance and processing each intent independently, thereby effectively transforming them into single-intent issues.

Author Contributions

Conceptualization, W.Z. and Y.G.; methodology, W.Z. and Y.G.; software, W.Z., Z.X., L.W., S.J., and X.Z.; validation, W.Z., Z.X., and X.Z.; formal analysis, W.Z., Y.G., L.W., and S.J.; resources, Y.G.; data curation, W.Z., Z.X., S.J., and G.Y.; writing—original draft preparation, W.Z. and G.Y.; writing—review and editing, Y.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Major Scientific & Technological Projects of Shandong Province under Grant 2021CXGC011204 and in part by the National Natural Science Foundation of Shandong Province under Grant ZR021MG054.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article and further inquiries can be directed to the corresponding author.

Acknowledgments

We thank those anonymous reviewers whose comments/suggestions helped improve and clarify this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wei, P.F.; Zeng, B.; Wang, M.H.; Zeng, A. Overview of joint modeling algorithms for spoken language understanding based on deep learning. J. Softw. 2022, 33, 4192–4216. [Google Scholar] [CrossRef]
  2. Yang, F.; Rao, Y.; Ding, Y.; He, W.B.; Ding, Z.F. The research progress in task-oriented dialogue systems. Chin. Inf. J. 2021, 35, 1–20. [Google Scholar] [CrossRef]
  3. Bi, R.; Wang, Y.; Zhou, X. Unknown intent detection for task-oriented dialogue based on reconstruction error. Comput. Eng. 2023, 49, 54–60. [Google Scholar] [CrossRef]
  4. Bi, R.; Yang, F.Y.; Zhou, X.; Yang, Y.T.; Abibula, A. A joint identification method for intent and slot based on cloze test with small samples. Comput. Eng. 2024, 1–12. [Google Scholar] [CrossRef]
  5. Zhou, C.; Wang, C.; Xia, Y.; Du, L. Joint recognition algorithm for intent and semantic slot in human-machine dialogue for industrial operation and maintenance. Res. Comput. Appl. 2024, 41, 3645–3650. [Google Scholar] [CrossRef]
  6. Chen, Q.; Zhou, Z.; Wang, W. BERT for joint intent classification and slot filling. arXiv 2019, arXiv:1902.10909. [Google Scholar] [CrossRef]
  7. Guo, X.C.; Hao, X.; Yao, X.C.; Li, L. Joint intent detection and slot filling of knowledge question answering for agricultural diseases and pests. Trans. Chin. Soc. Agric. Mach. 2023, 54, 205–215. [Google Scholar] [CrossRef]
  8. Zhou, T.Y.; Fan, Y.Q.; Du, Y.J.; Li, X.Y. Joint model of intent detection and slot filling based on fine-grained information integration. Res. Comput. Appl. 2023, 40, 2669–2673. [Google Scholar] [CrossRef]
  9. Goo, C.W.; Gao, G.; Hsu, Y.K.; Huo, C.L.; Chen, T.C.; Hsu, K.W.; Chen, Y.N. Slot-gated modeling for joint slot filling and intent prediction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, LA, USA, 1–6 June 2018; Association for Computational Linguistics: New Orleans, LA, USA, 2018; pp. 753–757. [Google Scholar] [CrossRef]
  10. Li, C.L.; Li, L.; Qi, J. A self-attentive model with gate mechanism for spoken language understanding. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 3824–3833. [Google Scholar] [CrossRef]
  11. Qin, L.B.; Che, W.X.; Li, Y.M.; Wen, H.Y.; Liu, T. A stack-propagation framework with token-level intent detection for spoken language understanding. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Association for Computational Linguistics: Hong Kong, China, 2019; pp. 2078–2087. [Google Scholar] [CrossRef]
  12. Qin, L.B.; Liu, T.L.; Che, W.X.; Kang, B.B.; Zhao, S.D.; Liu, T. A co-interactive transformer for joint slot filling and intent detection. arXiv 2020, arXiv:2010.03880. [Google Scholar] [CrossRef]
  13. Haffner, P.; Tur, G.; Wright, J.H. Optimizing SVMs for complex call classification. In Proceedings of the IEEE International Conference. on Acoustics, Speech, and Signal Processing (ICASSP 2003), Hong Kong, China, 6–10 April 2003; pp. 632–635. [Google Scholar] [CrossRef]
  14. Fu, B.; Liu, T. Implicit user consumption intent recognition in social media. Ruan Jian Xue Bao/J. Softw. 2016, 27, 2843–2854. [Google Scholar] [CrossRef]
  15. Ravuri, S.; Stolcke, A. Recurrent neural network and LSTM models for lexical utterance classification. In Interspeech 2015; ISCA: Shanghai, China, 2015; pp. 6075–6079. [Google Scholar] [CrossRef]
  16. Zhang, Z.C.; Zhang, Z.W.; Zhang, Z.M. User intent classification based on IndRNN-attention. J. Comput. Res. Dev. 2019, 56, 1517–1524. [Google Scholar] [CrossRef]
  17. Zhou, J.Z.; Zhu, Z.K.; He, Z.Q.; Chen, W.L.; Zhang, Z.M. Hybrid neural network models for human-machine dialogue intention classification. Ruan Jian Xue Bao/J. Softw. 2019, 30, 3313–3325. [Google Scholar] [CrossRef]
  18. Yao, K.S.; Zweig, G.; Hwang, M.Y.; Shi, Y.Y.; Yu, D. Recurrent neural networks for language understanding. In Interspeech 2013; ISCA: Lyon, France, 2013; pp. 2523–2527. [Google Scholar] [CrossRef]
  19. Yao, K.; Peng, B.; Zweig, G.; Yu, D.; Li, X.; Gao, F. Recurrent conditional random field for language understanding. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 4077–4081. [Google Scholar] [CrossRef]
  20. Mesnil, G.; He, X.; Deng, L.; Bengio, Y. Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding. In Proc. Interspeech 2013; ISCA: Lyon, France, 2013; pp. 3771–3775. [Google Scholar] [CrossRef]
  21. Yao, K.; Peng, B.; Zhang, Y.; Yu, D.; Zweig, G.; Shi, Y. Spoken language understanding using long short-term memory neural networks. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT), South Lake Tahoe, NV, USA, 7–10 December 2014; pp. 189–194. [Google Scholar] [CrossRef]
  22. Simonnet, E.; Camelin, N.; Deléglise, P.; Estève, Y. Exploring the use of attention-based recurrent neural networks for spoken language understanding. arXiv 2015, arXiv:1511.00569. [Google Scholar] [CrossRef]
  23. Zhu, S.; Yu, K. Encoder-decoder with focus-mechanism for sequence labelling based spoken language understanding. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5675–5679. [Google Scholar] [CrossRef]
  24. Yang, C.L.B. AISE: Attending to Intent and Slots Explicitly for better spoken language understanding. Eur. J. Med. Chem. 2020, 211, 112481–112490. [Google Scholar] [CrossRef]
  25. Liu, B.; Lane, I. Attention-based recurrent neural network models for joint intent detection and slot filling. In Interspeech 2016; ISCA: San Francisco, CA, USA, 2016; pp. 685–689. [Google Scholar] [CrossRef]
  26. Dao, M.H.; Truong, T.H.; Nguyen, D.Q. Intent detection and slot filling for Vietnamese. arXiv 2021, arXiv:2104.02021. [Google Scholar]
  27. Cao, P.; Yang, Z.; Li, X.; Li, Y. A Character-Word Information Interaction Framework for Natural Language Understanding in Chinese Medical Dialogue Domain. Appl. Sci. 2024, 14, 8926. [Google Scholar] [CrossRef]
  28. Li, H.; Zhang, Z.; Yan, Y.; Yue, Y. Enhanced domain multi-modal entity recognition based on knowledge graph. Comput. Eng. 2024, 50, 31–39. [Google Scholar] [CrossRef]
  29. Liu, W.; Zhou, P.; Zhao, Z.; Wang, Z.; Ju, Q.; Deng, H.; Wang, P. K-BERT: Enabling language representation with knowledge graph. arXiv 2019, arXiv:1909.07606. [Google Scholar] [CrossRef]
  30. Yoon, K. Convolutional neural networks for sentence classification. arXiv 2014, arXiv:1408.5882. [Google Scholar]
  31. Hai, E.; Pei, N.; Zhong, C.; Meina, S. A novel bi-directional interrelated model for joint intent detection and slot filling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Florence, Italy, 2019; pp. 5467–5471. [Google Scholar] [CrossRef]
  32. Chen, W.; Li, Z.; Fang, H.; Yao, Q.; Zhong, C.; Hao, J.; Zhang, Q.; Huang, X.; Peng, J.; Wei, Z. A benchmark for automatic medical consultation system: Frameworks, tasks, and datasets. Bioinformatics 2022, 39, btac817. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Example of intent detection and slot filling.
Figure 1. Example of intent detection and slot filling.
Applsci 15 00547 g001
Figure 2. Overall framework of the CIMKG model.
Figure 2. Overall framework of the CIMKG model.
Applsci 15 00547 g002
Figure 3. Knowledge graph-based shared encoding process.
Figure 3. Knowledge graph-based shared encoding process.
Applsci 15 00547 g003
Figure 4. Sentence tree structure.
Figure 4. Sentence tree structure.
Applsci 15 00547 g004
Figure 5. Text embedding representation.
Figure 5. Text embedding representation.
Applsci 15 00547 g005
Figure 6. Entity visible matrix.
Figure 6. Entity visible matrix.
Applsci 15 00547 g006
Figure 7. Training loss curve.
Figure 7. Training loss curve.
Applsci 15 00547 g007
Table 1. Model hyperparameters.
Table 1. Model hyperparameters.
ParametersNumerical Value
batch size16
K-BERT encoder hidden units128
co-interactive module hidden units128
co-interactive module number2
attention_dropout0.2
learning_rate5 × 10−5
batch size16
Table 2. Comparison of experimental results.
Table 2. Comparison of experimental results.
ModelMedicalConstruction Safety
Slot   ( F 1 )Intent (Acc)Overall (Acc) Slot   ( F 1 )Intent (Acc)Overall (Acc)
Joint BERT97.8272.8264.5692.6199.3085.61
BERT-DCA-Net97.9973.3864.7997.8799.2686.03
CIMKG98.1673.5665.8698.0299.2687.13
Table 3. Comparison of ablation experiments results.
Table 3. Comparison of ablation experiments results.
ModelMedicalConstruction Safety
Slot   ( F 1 )Intent (Acc)Overall (Acc) Slot   ( F 1 )Intent (Acc)Overall (Acc)
without intent attention layer98.1273.3365.3997.7799.2686.40
without slot attention layer98.1573.2465.5397.6399.2685.66
with intent-to-slot98.1073.4465.5797.4599.2685.29
with slot-to-intent97.9873.3364.9197.5899.2685.65
CIMKG98.1673.5665.8698.0299.2687.13
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, W.; Gao, Y.; Xu, Z.; Wang, L.; Ji, S.; Zhang, X.; Yuan, G. Research on Co-Interactive Model Based on Knowledge Graph for Intent Detection and Slot Filling. Appl. Sci. 2025, 15, 547. https://doi.org/10.3390/app15020547

AMA Style

Zhang W, Gao Y, Xu Z, Wang L, Ji S, Zhang X, Yuan G. Research on Co-Interactive Model Based on Knowledge Graph for Intent Detection and Slot Filling. Applied Sciences. 2025; 15(2):547. https://doi.org/10.3390/app15020547

Chicago/Turabian Style

Zhang, Wenwen, Yanfang Gao, Zifan Xu, Lin Wang, Shengxu Ji, Xiaohui Zhang, and Guanyu Yuan. 2025. "Research on Co-Interactive Model Based on Knowledge Graph for Intent Detection and Slot Filling" Applied Sciences 15, no. 2: 547. https://doi.org/10.3390/app15020547

APA Style

Zhang, W., Gao, Y., Xu, Z., Wang, L., Ji, S., Zhang, X., & Yuan, G. (2025). Research on Co-Interactive Model Based on Knowledge Graph for Intent Detection and Slot Filling. Applied Sciences, 15(2), 547. https://doi.org/10.3390/app15020547

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop