1. Introduction
Spoken language understanding [
1], which serves as the core task in task-oriented dialogue systems [
2,
3], aims to transform users’ natural language inputs into structured semantic representations [
4]. The effectiveness of spoken language understanding directly affects the overall performance of a dialogue system [
1]. Intent detection and slot filling, two fundamental subtasks in spoken language understanding, are crucial for comprehending and processing user requests [
5]. Intent detection is generally regarded as a sentence-level text classification task that aims to understand the underlying intent of a user’s request and categorize the given utterance into predefined intent categories [
6,
7]. Slot filling can be viewed as a character-level sequence classification task that aims to extract specific semantic concepts from the utterance [
6,
7]. The accurate identification of intents and slots can enhance spoken language understanding, which in turn improves the overall performance of the dialogue system.
Intent detection and slot filling share a common foundation in language understanding and are highly dependent on each other. Given the utterance, “What causes high Hb1c levels in patients?”, its intent category is “Request-Etiology”, and the slot is labelled “B-Check”. An example of intent detection and slot filling for this utterance is shown in
Figure 1.
In this example, when the dialogue system recognizes the user’s intent category as “Request-Etiology”, the detected intent guides the system to focus on the “HbA1c”, which is then classified into the named entity category of the “Medical_Examination”. Simultaneously, after the system recognizes “HbA1c” as a specific medical examination, it further detects the user’s intent to inquire about the specific reasons for any abnormalities in the examination results. This example demonstrates that intent detection can facilitate the determination of the slot type, and the information provided by the slot can, in turn, help the dialogue system more accurately detect the user’s intent.
Owing to the strong correlation between intent detection and slot filling, some joint models based on multi-task learning frameworks have been widely proposed to simultaneously tackle both intent detection and slot filling [
8]. However, certain existing joint models [
9,
10,
11] exclusively employ intent information in a unidirectional fashion to optimize slot filling while overlooking the potential guiding influence that slot information could exert on intent detection [
12]. In response to the shortcomings of existing joint models, Qin et al. [
12] introduced a co-interactive transformer specifically designed for joint intent detection and slot filling. This transformer utilizes the bidirectional information flow between intent detection and slot filling to model the relationship between the two tasks.
However, in short utterances lacking domain-specific contexts, challenges emerge in terms of ambiguous intent detection and difficulties in domain-specific entity recognition, specific abbreviations, and easily confused entity recognition. For example, in response to a user’s request, such as “What causes high Hb1c levels in patients?”, a general-domain model may not understand the meaning of “HbA1c”, which can easily result in errors in entity recognition. In fact, “HbA1c” is a type of glycated hemoglobin in the medical field that reflects the average blood glucose level over the past 2 to 3 months. In the utterance “What preparations are needed before undergoing a BMP test?”, “BMP” refers to the basic metabolic panel, which belongs to the category of “Medical_Examination”. However, in the utterance “What is the role of BMP in bone repair?”, “BMP” refers to a bone morphogenetic protein, which is the category of “Drug”. Clearly, the meaning of “BMP” differs in the two aforementioned utterances. Therefore, they are prone to cause errors in entity recognition without supplementary knowledge in the medical field, especially for domain-specific entities or abbreviations. When modelling intention detection and slot filling jointly, incorrect results from entity recognition can directly affect intention detection, subsequently leading to a decline in the overall performance of both intention detection and slot filling. Moreover, in the specialized domain, current joint models for intent recognition and slot filling [
9,
10,
11] merely associate the relevant information between intents and slots. They have not proposed effective solutions to address the issues that may arise in the joint process of intent recognition and slot filling, such as domain-specific entity recognition, specific abbreviations, and easily confused entity recognition.
In order to effectively utilize domain-specific knowledge in the joint modelling of intent detection and slot filling, we propose a co-interactive model based on a knowledge graph (CIMKG) for intent detection and slot filling. To enhance the overall performance of intent detection and slot filling tasks in specific domains, the model constructs a representation of the relationship between these two tasks by integrating triples sourced from the knowledge graph into utterances while taking into account the interdependencies that exist between these two tasks.
3. CIMKG
The CIMKG consists of three main components: a knowledge graph-based shared encoder module; a co-interactive module that explicitly establishes a bidirectional connection between intent detection and slot filling; and two decoders specifically designed for the two tasks. The overall framework of the CIMKG is shown in
Figure 2. In
Figure 2, the red lowercase letters (a–f) denote the readable sequence, whereas the black numerals (0–9) represent the actual sequence.
The primary objective of this study is to employ K-BERT as the encoding layer of the CIMKG model, where K-BERT possesses the capability to process a sentence tree with knowledge graph triples imported from knowledge graphs to ensure the effectiveness of knowledge integration through the utilization of soft position and visible matrix techniques embedded within K-BERT. Subsequently, the co-interactive module contained in reference [
12] is incorporated in CIMKG to enable the bidirectional flow of intent information and slot information, which allows the intent information to guide slot filling while simultaneously enabling slot information to provide guidance for intent detection. The model was initialized with BERT pretrained word embeddings and optimized using the Adam algorithm. The model was evaluated on standard datasets, and adjustments were made to the parameters or architecture design to further optimize the model’s performance based on the evaluation results.
3.1. Knowledge Graph-Based Shared Encoder Module
The knowledge graph-based shared encoder module mainly consists of four key components: the knowledge layer, embedding layer, seeing layer, and mask transformer. First, the knowledge layer imports ontology labels related to the input utterances and transforms them into a knowledge-rich sentence tree. Second, this sentence tree is processed in parallel, where it is converted into character-level embedding vectors at the embedding layer to capture deep semantic relationships, and transformed into a matrix representation at the seeing layer to control the visibility range of each token. Finally, the mask transformer integrates the embedding representations and visibility information, effectively capturing long-distance dependencies and complex patterns. The workflow of the shared encoder based on knowledge graphs is shown in
Figure 3.
The primary function of this module is to integrate semantic information, such as entities and relationships, from the domain knowledge graph into the encoding process, enabling the model to grasp the meaning expressed by short texts more accurately when processing utterances in specific domains.
3.1.1. Knowledge Layer
In the knowledge layer, entities from the domain-specific knowledge graph are linked to entities in the input utterance to construct a sentence tree that is rich in semantic information. This process primarily consists of two steps: knowledge query (K-query) and knowledge injection (K-injection). For an input sentence and a given domain-specific knowledge graph , the output sentence tree is represented as after undergoing knowledge query and knowledge injection. Here, denotes the number of tokens in the query, represents an entity, represents an ontology label, and represents the relationship between the intent and slot.
The knowledge query process involves retrieving and acquiring knowledge triplets from domain-specific knowledge graph
that are relatend to all named entities in utterance
. This process can be formally described as shown in (1):
where
is a collection of corresponding triples related to the entities in utterance
.
In the knowledge-injection process, the acquired knowledge triplets
are injected into the corresponding positions of the utterance
to obtain a sentence tree
. This process can be described as shown in (2):
Using “What causes high HbA1c levels in patients?” as an example, the sentence tree structure formed after importing knowledge graph triplets is shown in
Figure 4.
Here, [ is a classification token for intent detection, the red numbers represent the readable order, and the black numbers represent the actual order. The knowledge sentence after the injection of the knowledge graph is as follows.
What causes high Hb1c is glycated hemoglobin for glucose management levels in patients?
3.1.2. Embedding Layer
The embedding layer converts the sentence tree into the embedding representation required by the mask transformer. Although the injection of ontology labels enriches the semantics of the sentence tree, it simultaneously alters the structure of the input utterance, rendering it unreadable. The positional embeddings in the original BERT primarily provide the model with the ability to recognize the positions of individual tokens within the input sequence, ensuring that the model can correctly understand the sequential information of the tokens in the utterance. However, these positional embeddings do not address the problem of sentence tree unreadability. To address the problem of sentence structure changes caused by the introduction of external knowledge, the K-BERT model employs a soft position and a visible matrix.
Soft position embedding works by assigning each token a dynamically adjusted position index, which needs to be set according to the actual semantics and is reflected through the self-attention scores in the mask transformer to indicate the relative positional relationships between tokens, rather than their absolute physical positions. Considering the sentence tree in
Figure 3 as an example, the vector representations obtained after the text embedding layer are shown in
Figure 5. In
Figure 5, “+”sign is used to denote the summation of different types of embedding vectors and A is used as an identifier to mark sentence segments.
3.1.3. Seeing Layer
Injecting the triplets from the knowledge graph into the original utterance may lead to distortions in the semantics of the utterance without proper constraints. The seeing layer processes the structured information within the sentence tree and generates a visible matrix, which is used to control the visibility scope of each token in the sentence tree. Taking the sentence tree in
Figure 4 as an example, “glycated hemoglobin” is only used to explain “Hb1c” and is not related to “levels”. Therefore, “glycated hemoglobin” is visible to “Hb1c” but invisible to “levels”. The entity visibility matrix corresponding to the sentence tree in
Figure 4 is shown in
Figure 6.
The entity visible matrix
can be defined as shown in (3):
where
indicates that
is related to
and
indicates that
is unrelated to
. In this entity visible matrix, the element in row 6 and column 4 is 0, indicating that “glycated hemoglobin” is related to “Hb1c” in the sentence tree. The element in row 6 and column 9 is
, indicating that “glycated hemoglobin” is not related to “levels” in the sentence tree.
3.1.4. Mask Transformer Layer
The mask transformer is primarily responsible for encoding the embedding representations of the sentence tree and the entity visible matrix . In contrast, the transformer layer in the standard BERT model can only encode the embedding representations and cannot process the visible matrix. Therefore, by modifying the relevance calculation function within the self-attention mechanism of the original transformer layer, it is possible to restrict the visible range of each token based on the entity visible matrix , thereby preventing semantic information from being altered by the injection of external knowledge. Specifically, the relevance scores are combined with the entity–relationship visible matrix , and the final relevance degrees are calculated using the softmax function.
The original self-attention parameters are kept unchanged, as shown in (4).
where
are trainable parameters;
represents the hidden layer output of the
-th layer;
denotes the query vector of the
-th layer;
represents the key vector of the
-th layer; and
is the value vector of the
-th layer.
Subsequently, the softmax function was modified by injecting the entity–relationship visible matrix
as shown in (5):
where
represents the entity visible matrix,
denotes the relevance score of the
-th layer, and
denotes a predefined scaling factor in the model. Finally, the hidden layer output for the next layer is updated as shown in (6).
3.2. Co-Interactive Module
The co-interactive module aims to establish a bidirectional connection between intent detection and slot filling and consists of three key components: an intent and slot label attention layer, a co-interactive attention layer, and an extended feed-forward network layer [
12].
First, the intent and slot label attention layers are used to obtain slot and intent representations. Second, the co-interactive attention layer replaces traditional self-attention, explicitly establishing a relationship between intent and slots. Finally, the extended feed-forward network layer implicitly combines intent and slot information. This process captures and integrates intent and slot information through both explicit and implicit methods, thereby establishing a bidirectional connection between intent and slot information.
3.2.1. Intent and Slot Label Attention Layer
The intent and slot label attention layer uses a label attention mechanism to compute the label attention for the hidden state of each token in the input sequence, thereby obtaining explicit representations of the intent and slots.
Specifically, unlike the hidden states
of the input sequence used in [
12], the hidden states
that are obtained after the knowledge graph-based shared encoding layer not only include the token ids and seg-ids for each token in the input sequence but also include visible mask information. The hidden states
(
represents the number of tokens in the input sequence,
represents the hidden layer dimension) are used as queries, whereas the label embedding matrix
, where
represents intent,
represents slot, and
represents the number of labels for either intent or slot), serves as a key and value. The hidden state
is dotted with the label embedding matrix
, and the result is normalized using the softmax function to obtain the attention weight matrix
as shown in (7):
By performing a linear combination of the attention weight matrix
and label embedding matrix
, the original hidden state
is updated to obtain the final intent or slot representation
as shown in (8):
In particular, the label embedding matrix is represented by the parameters of the fully connected slot filling decoder layer and intent detection decoder layer. This means that the intent detection embedding matrix and the slot filling embedding matrix are actually the weight matrices of the decoder layers ( represent the number of slots and intent labels, respectively). Through the above calculation, the intent representation and slot representation , which capture semantic information, are obtained.
3.2.2. Co-Interactive Attention Layer
The co-interactive attention layer explicitly establishes a bidirectional relationship between intent detection and slot filling tasks. The co-interactive attention layer enables intent detection and slot filling to interact with each other, so that the slot representation is complemented by the intent information and the intent representation is enhanced by the slot information.
Specifically, the intent representation
and the slot representation
, obtained from the intent and slot label attention layers, are mapped through different linear transformations to form query (
), key (
), and value (
) matrices. Using
as the query,
as the key, and
as the value, the dot product of the query vector
and the transpose vector
of the slot key vector
is computed, scaled, and then passed through the softmax function. This is then multiplied by the slot value vector
to obtain the intent attention vector
containing slot information as shown in (9).
The updated intent representation
, which incorporates slot information, is obtained by adding
to the original intent representation
and then applying layer normalization as shown in (10).
where
LN denotes the layer normalization function.
By aligning the intent with its corresponding slot information, a representation of the intent that contains the corresponding slot information is obtained. The degree of association between the intent and slot is measured using the attention weights, and the intent representation is adjusted to obtain the most relevant intent representation of the slot information.
Similarly, using as the query, as the key, and as the value, the slot representation containing intent information can be obtained.
3.2.3. Extended Feed-Forward Network Layer
The extended feed-forward network layer is extended to implicitly fuse the intent representation
with the slot representation
, to form a new representation
that contains both intent and slot information as shown in (11):
where
and ⨁ represents concatenation.
To better understand the meaning expressed by each token, the features of adjacent tokens are combined to form a new feature vector
[
6] containing contextual information, which is represented as shown in (12):
Finally, the FFN layer fuses the intent information with the slot information to obtain the results shown in (13)–(15):
where
;
and
are the intent and slot representations, respectively, which are obtained through the fusion of information at the FFN layer and subsequent layer normalization.
3.3. Decoder for Slot Filling and Intent Detection
To ensure sufficient information interaction between intent detection and slot filling, reference [
12] employed a multi-layer stacked co-interactive attention network. After
layers of stacking, the updated intent representation
and slot representation
are obtained as shown in (16) and (17):
The max-pooling operation [
30] is applied to
to obtain the representation
of the entire sentence, which is passed as an input to a fully connected layer. The probability distribution of intent detection
is computed using the softmax function, and the index corresponding to the maximum value in the distribution is selected as the intent label
as shown in (18) and (19):
where
and
represent model training parameters.
A linear transformation is applied to
and a standard CRF layer [
31] is used to model the dependencies between labels as shown in (20) and (21):
where
represent the transition score from
to
, and
represents the predicted label sequence.
4. Experimental Results and Analysis
4.1. Experimental Data
This paper conducts experimental analysis using two datasets from the medical field and the construction safety field.
The IMCS21 dataset [
32] is a benchmark dataset designed for automatic medical consultation systems, consisting of a total of 4116 physician–patient dialogue case samples. It includes 2472 dialogues in the training set, 833 dialogues in one validation set, and 811 dialogues in another validation set. The IMCS21 dataset covers 10 pediatric diseases and encompasses five types of named entities and 16 dialogue intents. The complexity of the IMCS21 dataset lies primarily in its normalization of entities such as symptoms to address synonyms or different expressions. More than 1900 common symptoms have been standardized into 444 canonical names, facilitating the model’s ability to understand similar expressions and reducing ambiguities. To ensure data quality, some samples with incomplete information or too few dialogue turns were excluded from the original data in this paper. MedicalKG is a knowledge graph in the medical domain constructed by Liu et al. [
29]. It contains four types of named entities, namely (symptoms, diseases, parts, and treatments) and comprises 13,864 triples.
The dataset used in the paper for the field of construction safety is self-constructed. The construction safety accident knowledge graph comprises a total of 1156 entities, 2200 relations, and 45,772 knowledge triples. In this knowledge graph, there are six categories of entities and five types of relationships. Based on the professional knowledge stored in the construction safety knowledge graph and the practical application scenarios provided by practitioners, a total of 2648 utterances related to construction safety were constructed under the guidance of experts. This utterance dataset included a total of five categories of intents.
4.2. Model Training
The experiment adopts the Adam optimizer to tune the parameters of the CIMKG model. The hyperparameter settings are shown in
Table 1.
In
Table 1, “batch size” refers to the number of samples processed in a single training iteration. “K-BERT encoder hidden units” denotes the number of hidden units in the knowledge graph-based shared encoder. “co-interactive module hidden units” indicates the number of hidden units in the co-interactive module. “co-interactive module number” represents the number of co-interactive modules. “attention_dropout” is the dropout rate for the attention layer parameters, and “learning rate” is the learning rate of the model.
The CIMKG model was trained using the hyperparameters listed in
Table 1. The changes in the loss during the training process are shown in
Figure 7.
The “Loss vs. Step” figure on the left shows the loss values of the model at each training iteration. It is evident that the loss during the initial stage of training was exceedingly high. However, as the training progressed, the loss value decreased rapidly and stabilized at a certain threshold. This indicates that as the curves eventually flatten out, the current model training tends towards stability. The “Average Loss per Epoch” figure on the right presents the average loss values for each training epoch. As illustrated in the figure, the average loss experiences a sharp decline within the first few epochs and subsequently slows down to a gradual and smooth decreasing trend. The stable low loss at the end of training suggests that the model effectively fitted the training data.
4.3. Experimental Result Analysis
To evaluate the effectiveness of the CIMKG model quantitatively, the score was used to assess the slot filling performance of the model. The accuracy was used to evaluate the intent detection capability of the model. Furthermore, the overall accuracy is adopted to measure the sentence-level semantic frame parsing, which represents the percentage of utterances in which both the intent labels and slot labels are correctly identified out of the total number of samples.
Additionally, to fully demonstrate the effectiveness of the CIMKG model, Joint BERT and BERT-DCA-Net [
12] were introduced as baseline models to compare their performance with that of the CIMKG model in domain-specific intent detection and slot filling. The experimental results are listed in
Table 2.
As shown in
Table 2, the proposed CIMKG model outperformed the baseline models in terms of
score, accuracy, and overall accuracy. Specifically, not only do the individual metrics of
score and accuracy outperform those of the baseline models, but the overall accuracy is also higher. This indicates that the CIMKG model is capable of achieving a deeper understanding of semantic information in domain-specific utterances, thereby further enhancing the overall performance of the joint task of intent detection and slot filling.
Through a comparison between the Joint BERT and CIMKG models, it can be observed that the overall performance of the CIMKG model in processing short utterances within specialized domains surpasses that of Joint BERT. There are primarily two reasons for this superiority. Firstly, the CIMKG model enhances the ability of utterance representation by incorporating domain-specific knowledge into the utterance through the integration of a knowledge graph. According to the Slot (
) results in
Table 2, the experimental outcomes demonstrate that the CIMKG model outperforms Joint BERT in both the medical and construction safety domains. This indicates that the injection of knowledge graph triples into utterances can effectively address the challenges of identifying specialized or easily confused terminology within specialized domains. Secondly, the CIMKG model adopts a collaborative interactive approach for simultaneous bidirectional communication and guidance between intent detection and slot filling, which is superior to the method employed by Joint BERT, which relies solely on parameter sharing to implicitly learn the correlation between intent detection and slot filling. While slots are being identified, the results of slot detection also serve as guiding information for the intent recognition process, and, similarly, the results of intent detection have an impact on the identification of slot filling. Similar reasons can also be observed from the comparison of experimental results between Joint BERT and BERT-DCA-Net in
Table 2. When compared to BERT-DCA-Net, although both BERT-DCA-Net and CIMKG adopt a collaborative interactive mode for joint intent detection and slot filling, the incorporation of a knowledge graph in the CIMKG model enables it to better enhance the overall performance of intent detection and slot filling in specialized domains. Therefore, the CIMKG model outperforms BERT-DCA-Net in terms of Slot (
), Intent (Accuracy), and Overall (Accuracy).
To further validate the performance of the proposed CIMKG, we conducted ablation experiments, and the results are shown in
Table 3.
By removing the intent attention layer from the CIMKG, we obtained a new model named “
without intent attention layer”, in which the original hidden states
are utilized to replace the hidden states
that would have been processed by the intent attention layer. Similarly, another model referred to as “
without slot attention layer” was obtained by solely removing the slot attention layer. As shown in
Table 3, compared with the CIMKG, the performance of slot filling and intent detection in these two modules decreased, indicating that explicit intent and slot representations are crucial for the co-interactive layer between the two tasks. Subsequently, to verify the role of bidirectional information flow, only one direction of the information flow was retained. The models with only one direction of information flow were named the “
with intent-to-slot” and the “
with slot-to-intent”. As shown in
Table 3, the CIMKG performed better with a bidirectional information flow configuration. The ablation experimental results indicate that modelling the interaction between intent detection and slot filling can enhance the performance of both tasks, further demonstrating the strong correlation between intent detection and slot filling.
5. Conclusions and Future Work
This study proposes the CIMKG, which jointly models intent detection and slot filling based on a knowledge graph. The CIMKG thoroughly addresses the problem of the underperformance of general pretrained language models in intent detection and slot filling within specific domains. Specifically, it not only employs a knowledge graph-based encoder module to tackle the problem of knowledge noise that may arise from the introduction of domain-specific expertise but also utilizes a co-interactive module to enable a bidirectional flow of intent and slot information. This fully exploits the interrelationship between intents and slots, thereby enhancing the overall performance of intent detection and slot filling in specific domains. Experiments were conducted using knowledge graphs in both the medical and construction safety domains and related user question–answer data. The experimental results demonstrate that CIMKG effectively addresses the problems of entity recognition and intent detection errors caused by insufficient prior knowledge in both the medical and architectural safety domains. Furthermore, CIMKG significantly improves intent detection and slot filling performance. However, in this study, the utilization of the IMCS21 dataset, which encompasses 16 categories of intents, renders the intent detection task relatively complex. Despite the adoption of the softmax function to predict the probability distribution across various categories, the multi-categories classification itself exacerbates the difficulty for the model to distinguish between different intents, consequently leading to a relatively low accuracy in Intent (Acc). Furthermore, Overall (Acc) reflects the scenario where both the intents and slots are correctly predicted. Hence, it not only relies on the accuracy of intent detection but is also influenced by the outcome of slot filling. Consequently, the overall experimental results may not be as high as anticipated. In addition, it should be noted that this study focused only on single-intent scenarios. Future research will build upon this foundation to investigate and improve the model’s ability to handle multi-intent scenarios. Specifically, a dedicated multi-intent processing module will be designed on the basis of the CIMKG. This module will be capable of identifying multiple intents within an utterance and processing each intent independently, thereby effectively transforming them into single-intent issues.