Essentially, the classification layer for the tasks involves two separate classifiers, and each classifier outputs the possibilities of the labels. For the baseline [
3], the classification layer based on JointBERT [
24], the sequential connotation of the intent classifier is the max-pooling layer, a dropout layer with a dropout rate of
, a linear layer with
dimensions, the GELU active function, another dropout layer with the same dropout rate, and a final linear layer with
dimensions (the number of intent labels). The difference regarding the slot classifier is that the max-pooling layer is not used, and the dimensions of the final linear layer are
(the number of slot labels) instead. In the zero-shot case, the feed-forward dropout rate
was set to 0.25, and the number of linear dimensions
was 8192.
The intent label is unique for each sentence, so the output shape of the intent classier should be . In consideration of the output shape, adding a max-pooling layer in the intent classifier could reduce the number of input feature dimensions to 1. On the other hand, the slot-filling task computes the slot possibility of each word with an output shape , which is why the pooling reduction is not necessary for the slot-filling layer.
3.2.1. Hierarchical Classifiers
When given an intent label, the probabilities for some slot labels in a sentence would be pretty high, and some might be very low. For instance, if the utterance intention is “transport_taxi”, slots labels like “transport_agency”, “currency_name”, “date”, “place_name”, and “transport_descriptor” would more likely be filled with some words in the sentence, while filling “player_setting”, “music_album”, “music_descriptor”, and “audiobook_author” is almost impossible.
Generally, detecting the intent of an utterance is much easier than filling all the slots correctly. This situation gave us the idea that intent probabilities could be used to compute the likelihood of slot labels. Based on this idea, we present the classification layer with a hierarchical architecture in this section.
As shown in
Figure 1 and
Figure 2 show, the hidden features from the encoder with the shape
were fed to the intent classifier, denoted by
, in which
. Max pooling reduces the shape of the hidden features to
by maintaining the maximum in each
. Let the output of max pooling be
; thus, we have each element
in
:
By feeding the max-pooling output to the other layers, the classifier can compute the intent probabilities
by applying other dropout and linear layers. The output
of the final linear layer is provided by the following equation:
The final linear layer of the intent classifier condenses the features to heads, and the dimensions of the dropout layer output are correspond to . Therefore, represents the output of the last dropout layer, denotes the weights of the linear layer, and denotes the biases. Here, we have the output of the intent classifiers with a shape of .
In the next step, will update the hidden features before applying slot clarification. There are more than two ways to update the hidden features, but we explored two in this paper:
- (1)
The first is concatenation, as shown in
Figure 1. The intent logits will concatenate with the output from the encoder, the simplest method of updating;
- (2)
The second is attention-based computation, updating the initial slot logits via computing the intent and slot attention probabilities, which is another effective and smooth way of updating the hidden features. The architecture is shown in
Figure 2.
Hier-concatenation: We have the output of the encoder layer, the hidden features , the output of the intent classifier, and the intent logits , but the shapes of the two matrices are not yet matched for concatenation. Let denote the hidden features, while represents a feature vector for each word.
As the basic idea behind the relationship between intent and slots, intent logits will affect every word’s slot-filling task. Next, we expand
to a sequence of expanded features with a length of
L:
where
, for
.
Now,
and
can be concatenated as a new sequence of hidden features, leading to Formula (
4), which will serve as the input of the slot classifier:
where
.
The output of the first linear layer in the slot classifier is as follows:
Before the linear layer is applied, a dropout layer randomly sets some values to 0 with the possibility and scales others by the vector of . The shape of the dropout output is the same as , and we use to represent the output dimension of the linear layer; thus, and . So, the output of the first linear layer of the slot classifier is .
We set the head for the final layer; thus, the slot classifier finally outputs a matrix for the prediction and loss computation.
Hier-attention-based computation: Updating the slot logits by computing the attention-based scores requires more operations than pure concatenation. We established a sequential network to compute the likelihood of the slot labels in given intent logits. The sequence of the network is as follows:
A dropout layer: dropout probability = ;
A linear layer: input dimensions and output dimensions ;
An activation function: GELU;
A dropout layer: dropout probability = ;
A linear layer: input dimensions and output dimensions ;
A layer normalization procedure: eps = .
Let the function
denote the computations in the network above, to which the broadcasted intent logits
(as Equation (
3)) are fed; thus, we have
Here,
is the same shape as in the concatenation. Since the output dimension of the last linear is
, we have
, which is the same size as the initial slot logits
. In the next step, compute the dot-product of
and
, scale it by a factor of
; then, apply the softmax function to obtain attention probabilities
:
There are two popular ways of using attention probabilities: additive attention [
33] and dot-product (multiplicative) attention. We used dot-product attention in this case because it is faster and more space-efficient in practice [
31]. Thus, we attained the contextual likelihood of slot labels
:
Furthermore, we wished to update the slot logits smoothly, applying the
function to
. Finally, the new slot logits
is provided by the following equation:
3.2.2. Bidirectional Classifiers
Not only will the intent labels affect the filling slots, but the slot results could revise the wrong intent of the utterance if the model has correctly predicted all the slots. A bidirectional classification layer will use both intent and slot logits to update the hidden features.
From intent logits to slot logits, the way in which to update the hidden features is the same as that used for hierarchical classifiers. In the other direction, let the slot logits update the intent features after the max-pooling layer; we can use bi-concatenation (
Figure 3) and bi-attention-based computation structures (
Figure 4) to achieve this goal. But, we need to slightly modify the algorithm.
Bi-concatenation: The second dimension of the slot probabilities is L, and that of intent features is 1, thereby obstructing the possibility of direct concatenation in order to construct new intent features. We introduce two ways of extracting the information in before concatenating it with :
- (1)
Apply the max pooling layer to slot probabilities ;
- (2)
Apply the LSTM layer and input as a sequence of word slot probability vectors with time steps, and the output of the final LSTM time step will consist of all slot information.
Max pooling: The shape of slot probabilities
is
, and the shape of intent features
is
. The max pooling layer will reduce matrix
to a vector
. Usually, when we utilize the slot results to update intent features, all the values in
will concatenate to
via reshaping into a vector with a shape of
[
30]. But, we know that the lengths of the sentence
L are different in practice; thus, the uncertain
L leads to an indeterminate shape of
and the new intent feature
. This uncertainty makes the initialization network difficult to work with when training or evaluating the data.
One solution to solving the uncertainty problem is placing a sequence of zeros after a sentence to guarantee that all the sentences have the same length L and that L is the maximum sequence length. With the help of the attention mask, the model can ignore irrelevant information in the prediction phase. But this solution requires a large amount of memory from the GPU and a long time for computation.
We used a max pooling layer to zip the sequence of slot logits in a vector with a shape of
. We present
in the column vector as
, for
, so the new vector
, for
was given as follows:
In the next step, is concatenated with , we have new intent features , and will be used to update the intent logits.
The mechanism of max pooling consists of retaining the max value of each slot label regardless of the position and the frequency. If the sentence contains a slot label, this slot logit would be higher than the labels not contained. High logits will have more impactful effects on updating intent logits.
LSTM: The LSTM network is a variant of an RNN, where the output from the last time step is input into the next time step. By setting input, forget, cell, and output gates, the LSTM network can retain long-term memories from the beginning of the sequence. Thus, the final time step output of the LSTM can be regarded as a summary of the whole utterance and yields the probability distribution of the slot labels.
Let the
function consist of all the gate computations; the result of the
function is the output of an LSTM cell. When we input
row by row in sequence, we obtain
The final output of the LSTM network is
, in this case, is yielded by the following equation:
To date, we have to compute new intent logits. Other neural networks similar to the intent classifier were applied for this objective. The neural networks were stacked in sequence as follows:
A dropout layer—dropout probability = ;
A linear layer—input dimensions ; output dimensions ;
An activation function—GELU;
A dropout layer—dropout probability = ;
A linear layer—input dimensions ; output dimensions .
Bi-attention-based computation: As
Figure 4 shows, firstly, the model computes the initial intent logits
and slot logits
using the classifiers. Secondly,
is used to compute the contextual slot probabilities, as in the hier-attention architecture, and update
to obtain new slot logits
.
We need to reduce the matrix
to a vector
(like in Equation (
10)) with a shape of
for the slot for intent attention-based computation. Here, we let the max logit value denote the likelihood of the slot label being filled in the sentence, regardless of the position and quantity of the slot label, like the max pooling step in bi-concatenation.
Then,
is input into the transition function
to compute the transition probabilities for slots transitioning into intent:
Let the dimensions of the linear layer adjust to the data size; we set the input dimensions of the first linear layer to , and the output dimensions of the final linear layer are .
Finally, by summing the initial intent logits
and the contextual likelihood of the slot-to-intent transition labels
, we attain new intent logits
:
where