1. Introduction
Recently, pre-trained language models (LMs), such as BERT [
1] and RoBERTa [
2], have achieved a dominant performance on almost all natural language processing (NLP) tasks, upon simply fine-tuning these LMs with an extra task-specific head with task-specific training data in the downstream tasks. Despite the effectiveness and simplicity of fine-tuning LMs, there is still a wide gap between the objective functions of the pre-training and fine-tuning phases. A common conclusion in the literature [
3,
4] is that this mismatch results in the under-utilization of these powerful LMs.
Prompt-based approaches [
5,
6,
7,
8,
9] have been proposed to address this problem. Unlike traditional supervised learning, which solely utilizes the parameters in LMs with rich distributed knowledge, prompt-based methods reformulate a downstream task’s objective forms as those in the pre-training phase, directly modelling the probability of words without using any task-specific layers [
3]. As shown in
Figure 1, the sentiment classification, for example, can identify the sentiment
towards a given input sentence
. In traditional LM fine-tuning, we take the softmax of the special word, such as
[CLS], and the true label
y as the loss function to further train the LM. Then, we obtain the predicted label
as the sentiment predicted. In typical prompt tuning, we add a template
containing a
[MASK] special token to the original input sequence
X, then feed the new sequence
into the LM and let the LM predict the
[MASK] token of the target token in the vocabulary, indicating the sentiment of the original input. Recent efforts show that prompt-based methods, as shown above, have achieved promising results in many sentence-level NLP tasks, such as natural language inference [
10], sentence classification [
5], and factual probing [
11]. Despite success in sentence-level classification tasks, prompt-based methods work poorly in token-level classification tasks, such as named entity recognition (NER) and parts of speech (POS).
As a fundamental task, NER is irreplaceable in many downstream NPL tasks, such as event recognition, entity linking, etc. NER aims to put the named entity mentioned in a sentence into some pre-defined categories, such as location, person, organization, etc. Former efforts have often required an extra label-specific output dense layer, which is randomly initialized. This makes it difficult for the model to fit into an optimal point. Liu et al. [
12] adopted NER prompt tuning, not introducing any extra parameters other than the parameters of the pre-trained model. They enumerated all possible entity spans and filled them in templates, meaning that inferring a sentence required feeding that sentence into the model many times. Despite its effectiveness, the enumeration procedure is time-consuming and intolerable.
Ma et al. [
13] proposed a template-free prompt-tuning model for few-shot NER. They eliminated the use of templates and let the model predict class-related pivot words derived from unlabelled data instead of original words at each entity position while still predicting the original words at non-entity positions. In this way, inferring a sentence only needs the sentence to be fed into the model once. Their model gained a lot in few-shot settings while working ordinarily in rich-resource settings.
In this study, we propose a simple yet effective variation on the prompt tuning for NER. In the
BIO scheme, the tags
B and
I denote that the current word is at the beginning or inside of an entity, respectively, and
O denotes that the current word is not a component of the entity. In the
IO scheme, the beginning of an entity is also tagged with
I. For example, unlike Ma et al. [
13], the
IO scheme can be used to find label words; however, this makes it difficult for the model to separate several consecutive homogeneous entities, and the correlations between tags are neglected. Furthermore, the beginning and interior of an entity often convey different semantic information. For instance, the word
City in the
LOC entity
New York City is more likely to be predicted as
I-LOC rather than
B-LOC in the
BIO scheme, while in the
IO scheme, the implicit semantic gaps between all three words are neglected, and all three words in
New York City are treated equivalently. We derive the top-
K tag-wise label words in the
BIO scheme according to the frequency of occurrence and the corresponding normalized frequency. Then we let the pre-trained model predict the label words at each position and feed the generated logits (non-normalized probability) to a CRF layer to capture the correlations between the tags. We do not introduce any extra parameters other than the parameters of the pre-trained model to obtain the logits of all tags at each position.
Our contributions are as follows: (i) We found that the feature changes after the MLMs were limited, which can improve the effectiveness of the NER task and avoid introducing additional parameters. (ii) We proposed a simple yet effective variation on the prompt tuning for NER. (iii) We do not introduce any extra parameters other than the parameters of the pre-trained model to obtain the logits of all tags at each position. (iv) Experiments show that our proposed method outperforms the state-of-the-art model on three popular datasets.
4. Modelling VPN
We let the LM predict several label words in the vocabulary and obtain the overall tag-related logits. These label words are more relevant to tags rather than classes. In this way, we can also model the logits of positions labelled O and use the BIO scheme rather than the IO scheme, which can use the CRF layer to boost the model’s performance.
In this work, we consider an NER task as a sequence-to-sequence task.
Figure 3 shows the overall architecture of our proposed model. Given an input sequence
and the corresponding label sequence
, we embed each word using a pre-trained LM to obtain an embedded sequence
:
where
is the last layer of the hidden state of word
, and
T and
denote the sequence length and hidden dimension of the transformer model, respectively.
In order to take full advantage of the pre-trained model, along with BERT’s pre-training stage, we calculate the word prediction logits using the masked language model head as follows:
where
12,
, and
represents the cardinal number of the vocabulary.
For each word
, we obtain the label logit through mean pooling corresponding to the top
K representative words of the entity tags, that is,
where
.
Then, we feed the
to the conditional random field (CRF) [
26] layer. Implementation-wise, CRF computes an energy given a candidate output
and a context
(i.e., input sequence), followed by a softmax operator to obtain the conditional likelihood, i.e.,
Here, is the set of all possible tag sequences, and the transition matrix characterizes the smoothness of the label sequence (probability of switching between consequent labels).