1. Introduction
Patents have become the main carrier of scientific and technological progress and innovation [
1]. Technicians not only understand the development status of new products and technologies, but also obtain inspiration and base new products and technologies on patent literature [
2]. However, the explosively growing amount of patent literature hinders patent analysts finding valuable patent literature quickly and accurately. Patent analysts have to consider more technical keywords and classification codes for more suitable patents. Thus, the keyword extraction method helps patent analysts find more state-of-the-art patents and technologies.
Keyword extraction is an important task in patent mining, which extracts keywords from the patent and classifies the patents according to certain rules. Traditional keyword extraction approaches classify the patents with classification codes such as IPC and CPC according to similar technical areas [
3]. However, classification codes are hard to use to summarize the patent information in detail due to a large number of applications and the potential complexity of an invention. In addition, patent administration tends to assign some classification codes according to technical field and application field, which ignores more valuable information such as creative thinking and techniques. Therefore, various natural language processing methods have been applied in the patent field for better automatic patent classification. These patent keyword extraction methods can be roughly divided into unsupervised keyword extraction methods and supervised keyword extraction methods.
Unsupervised keyword extraction methods mainly incorporate statistical analysis such as N-gram, TF, IDF, word frequency, or graph models to analyze the importance of words. Florescu combines TF and IDF metrics to calculate the ratio of candidate word frequency and inverse document frequency to obtain candidate word weights [
4] and then sorts the candidate words to achieve keyword determination. Hassoud et al. [
5] calculated the average value based on the position of candidate keywords, thereby achieving the value of candidate keyword weights. Differently from statistical analysis, Mihalcae [
6] proposes using keyword co-occurrence relationships in fixed windows to establish internode connections and using PageRank to update node weights to extract keywords from candidate words. In addition, various researches have introduced additional information to update weights of graph models, such as point mutual information [
7], word meaning information [
8], location information, and topic information [
9], in order to effectively identify keywords. However, existing unsupervised keyword extraction methods not only ignore low-frequency words and highly related words [
10], but also have difficulty extracting main technical words [
11].
Therefore, supervised keyword extraction methods have been introduced to utilize machine learning algorithms or deep learning models to transform keyword extraction tasks into encoding or binary classification tasks. Yang et al. [
12] used a neural network model to extract features from candidate keywords and then used a label classification layer based on the Softmax function to determine candidate keywords. Duari et al. [
13] constructed a naive Bayesian model in advance and combined features such as node strength, position rank, and clustering coefficient to classify candidate words and obtain keywords. Wei et al. [
14] extracted candidate keywords using long short-term memory (LSTM) neural networks and logistic regression models and set certain recombination filtering rules to improve the model’s recognition of low-frequency and long tail keywords. The KEA method proposed by Frank [
15] uses a naive Bayesian model to classify candidate words based on their TF-IDF values and positional information; Wang et al. [
16] used support vector machines to filter keywords based on word frequency and positional information. Haddoud et al. [
17] used logistic regression algorithms to extract keywords based on word length and frequency. Meng et al. [
18] proposed CopyRNN, based on the Seq2Seq model, which uses a recursive neural network to compress the semantic information of a given text into dense vectors, and then decodes this vector into keywords that exist in the target vocabulary. Zhang et al. [
19] classified text words into positive and negative categories based on LSTM, and then trained an LSTM model to classify the words in the text to obtain the keywords. Keyword extraction was directly modeled as a binary classification task for words.
In the process of patent examination, the patent examiner needs to understand by comprehensively considering technical field, technical problem, technical solutions, and technical effects from a description. Then, the patent examiner needs to extract sufficient technical keywords from the above four parts. However, the existing keyword extraction methods only extract keywords from abstracts or titles, which ignores the expression of the whole patent. In addition, the existing keyword extraction methods also have difficulty handling long texts, such as descriptions of patents. Therefore, this paper proposes a patent keyword extraction method based on corpus classification, termed “PKECC”, as a simulation of a patent examiner. The main contributions are listed below.
A corpus classification model based on multi-level feature fusion is incorporated into sentence division of description. Since the patent description is related to long texts, a corpus classification model based on multi-level feature fusion adopts multi-level Bert encoding layers and multi-level self-attention layers according to the words, sentences and paragraphs for sentence division to simplify the process of keyword extraction.
A keyword extraction model based on BiLSTM and CRF fusion is proposed to extract keywords from divided sentences. Although BiLSTM could extract a number of keywords, the performance of keyword extraction is limited. Thus, the keyword extraction based on BiLSTM and CRF adopts BiLSTM for more comprehensive semantic features and CRF model for better keyword prediction.
The proposed PKECC method is compared with 5 traditional or state-of-the-art models on three types of patent datasets. The results verify that the proposed PKECC achieves better accuracy, F1 score, and recall rate on low category patent datasets.
This paper is organized as follows.
Section 2 introduces works relating to the Bert model, hierarchical attention mechanism, and BiLSTM model.
Section 3 gives the details of the proposed keyword extraction method.
Section 4 presents the experimental results from related patent datasets. Finally, the concluding remarks are given in
Section 5.
2. Related Work
2.1. Bert
Due to the fact that a unique encoding method cannot fully explore the semantic features of a text, word embedding patterns, represented by Bert (bidirectional encoder representations from transformers) and Word2Vec, are gradually gaining ground. The Bert model was originally launched by Google in 2018 as a pre-trained language representation model based on the Transformer architecture, which significantly promoted the development of language understanding [
20]. The Bert model uses a 12-layer Transformer architecture to train the model on a large number of general corpora, utilizing the context of all encoding layers to train deep bi-directional representations.
Compared to Word2Vec, which is represented by static word vectors, the Bert model, which is represented by dynamic word vectors, converts each word into three types of vectors after obtaining input text, including word vectors, sentence vectors, and position vectors. The word vector is transformed from each word in the text by the Bert model; sentence vectors are used to represent the global semantic information of text and integrate it with the semantic information of characters; position vectors are used to represent the semantic information differences carried by words that appear at different positions in the text.
Transformer is a deep learning architecture proposed by Vaswani in 2017. Transformer relies on a multi-head attention mechanism to capture contextual information in input sequences, making it highly suitable for processing sequential data such as text. It is worth noting that it does not contain recurrent units, so it requires less training time than previous recurrent neural architectures such as LSTM. Generally, the multi-head attention mechanism of the Transformer architecture includes encoder self-attention in the encoder, decoder self-attention in the decoder, and encoder decoder attention. The attention module will perform parallel and repeated calculations multiple times; each time is referred to as an attention head. Ultimately, all these attention head calculations are merged together to form the final attention score. Transformer can better capture the multiple relationships and subtle differences between each word, which is crucial for capturing long-distance dependencies and understanding the context in which vocabulary appears.
The steps of Bert are divided into the pre-training stage and the fine-tuning stage [
21]. During the pre-training phase, bi-directional pre-training is performed on a large quantity of unlabeled text data. The fine-tuning stage initializes all parameters using the pre-trained model, and for specific tasks, labeled data is required for model training parameters. Different downstream tasks can train different models, allowing the model to be flexibly applied to different tasks without the need to train from scratch.
Although Bert can be considered the best model among current NLP algorithms, it still has its drawbacks. For example, the mismatch between pre-training and fine-tuning, as well as the need for more training steps, requires a significant amount of computation. Various improved versions of Bert models have been proposed, including improving training methods, optimizing model structure, and miniaturizing models.
In addition, traditional Bert models perform poorly in long texts, of which patent descriptions are a form. Therefore, extracting keywords from long texts has become a focus of this paper. In this paper, the proposed PKECC adopts a multi-level Bert model, which divides the description into word-level, sentence-level and paragraph-level Bert models for simplifying the whole Bert encoding model.
2.2. Hierarchical Attention Mechanism
In the field of long text classification, Yang et al. proposed a hierarchical attention mechanism [
22]. This hierarchical attention mechanism is mainly divided into a word-level attention layer and sentence-level attention layer when constructing the classifier model, and then uses a bi-directional recurrent neural network and traditional attention mechanism for information extraction.
The hierarchical attention mechanism divides long texts into multiple sentences in advance. The word level attention layer takes the word embedding matrix formed from sentences composed of words as the input of the word level attention layer. It uses a bi-directional recurrent neural network to learn the contextual information of words in the sentence, and then identifies keywords through attention mechanisms. The sentence level attention layer combines the word vectors obtained from the word level attention layer to form a sentence vector. The sentence level attention layer also uses a bi-directional recurrent neural network to learn the relationships between sentences and combines attention mechanisms to determine key sentences. Finally, the classification of text categories is achieved through the softmax function.
Although hierarchical attention mechanism may simplify the complexity of the model effectively, different length of word vectors and relationship are difficult to identify and develop. Different from traditional attention mechanism, self-attention mechanism finds interrelationship among word vectors better. Therefore, this paper incorporates self-attention mechanism into hierarchical attention mechanism for better classification performance.
2.3. BiLSTM
LSTM (long short-term memory) is a form of RNN (recurrent neural network). LSTM was proposed by Hochreiter and Schmidhuber in 1997, aiming to overcome the problem of vanishing gradients during RNN training and capture long-term dependencies in sequence data. Similar to RNN, the traditional LSTM model includes memory units, input gates, forget gates, and output gates. LSTM could find the relationships between input text and solve the problems of gradient explosion and gradient disappearance on long text. While logistic regression performs better on classification problem. LSTM is skilled at processing sequential data. In addition, logistic regression mainly classifies according to the maximum likelihood of probability, but LSTM originates from RNN. However, LSTM still lacks the ability to encode information from back to front in statement modeling.
BiLSTM (bi-directional long short-term memory) is a recurrent neural network (RNN) architecture that extends traditional LSTM models by processing input sequences in both forward and backward directions. The BiLSTM architecture extends the LSTM architecture by processing input sequences in both forward and backward directions, allowing the network to capture past and future dependencies that may exist at each time step. This architecture consists of two LSTM layers: a forward LSTM layer and a backward LSTM layer. The forward LSTM layer processes the input sequence from the first time step to the last time step. Each time step involves interaction with input gates, forget gates, and output gates to capture relevant information about the sequence. The hidden state of the forward LSTM layer represents the accumulated information from the input sequence up to the current time step. The backward LSTM layer processes the input sequence in reverse order, starting from the last time step and moving towards the first time step, which can capture the opposite dependency relationship as the forward LSTM layer. The hidden state of the backward LSTM layer presents the information from the input sequence to the current time step, but the order is opposite to that of the forward LSTM layer. By combining the hidden states of the forward and backward LSTM layers at each time step, the model obtains a comprehensive understanding of the input sequence. This bi-directional information flow is very valuable for scenarios where bi-directional context is crucial in tasks such as speech recognition, machine translation, and sentiment analysis.
Although BiLSTM has advantages in capturing bi-directional dependencies, BiLSTM only chooses labels with a high probability, which ignores the sequence of labels. Thus, the proposed PKECC adopts BiLSTM with the CRF model to make use of sequence annotation in the keyword extraction task. PKECC combines the strong semantic vector representation ability of the BiLSTM architecture and the advantages of CRF in learning the transition probability between labels, achieving good results.
3. Methods
This section mainly describes the process of the proposed keyword extraction method based on corpus classification. Firstly, this section mainly introduces the data collection and data preprocessing before the keyword extraction. Then, this section describes a corpus classification model based on a multi-level attention mechanism. Next, this section introduces a keyword extraction method based on the fusion of BiLSTM and CRF. Finally, this section presents the overall steps of the proposed algorithm.
3.1. Data Collection
To verify the performance of proposed model, the patent datasets includes some open datasets and self-annotated patent datasets. Self-annotated datasets are downloaded from Patsnap and other commercial patent datasets. The patent datasets include text, description, sentence types, and annotated keywords. The sentence type are divided into technical field, technical problem, technical solution, and technical effect.
3.2. Data Preprocessing
The downloaded patent datasets contains many duplicate patent texts and inconsistent patent text formats. Moreover, the patent text includes a number of function words, pronouns, verbs, nouns without actual meanings. Therefore, this paper adopts HIT stop words to construct the patent stop dictionary, which consists of publicly available and frequently used dictionaries as well as descriptive words for patents. In addition, this paper sets stop words related to patents based on the patent datasets, such as “including”, “disclosed”, “applicable to”, “present invention”, “combined”. Meanwhile, this paper also adopts the Jieba word segmentation tool on Chinese patents to the fine-grained segmentation and predicative phrase annotation, which serve as inputs to the LSTM layer.
3.3. Corpus Classification Model Based on Multi-Level Feature Fusion
The traditional patent keyword extraction method mainly utilizes the short text content such as abstracts and titles for keyword extraction and classification. Unlike traditional keyword extraction methods, the proposed algorithm requires dividing the corpus in advance into four aspects: technical field, technical problem, technical solution, and technical effect in the patent specification to prepare for subsequent keyword extraction. Therefore, PKECC adopts a hierarchical attention mechanism to process patent classification, which is shown in
Figure 1.
As shown in
Figure 1, the proposed mechanism is divided into five aspects, including multi-level Bert encoding layer, word-level self-attention layer, sentence-level self-attention layer, paragraph-level self-attention layer and classification layer. First, the multi-level Bert encoding layer divides the input text into a number of paragraphs, sentences and words, and encodes these paragraphs, sentences and words with Bert model. Then, the word-level, sentence-level, and paragraph-level self-attention layers adapt the multi-level self-attention layer to encoding words, sentences, and paragraphs respectively. The multi-level self-attention layer applies forward and backward GRU for feature extraction and self-attention mechanism for better classification performance. Finally, the classification layer adopts the softmax function to classify the whole sentences into technical field, technical problem, technical solution, and technical effect.
3.3.1. Multi-Level Bert Encoding Layer
Patent specifications involve a large quantity of long sentences and textual information, and traditional attention layers cannot effectively classify long texts. Therefore, the multi-level Bert encoding layer divides the entire text into K paragraphs based on words, sentences, and paragraphs, each paragraph consisting of L sentences, and each sentence consisting of T words, where mij represents the j-th word of the i-th sentence in the m-th paragraph. Then, the layer is further divided into the word-encoding layer, sentence-encoding layer, and paragraph-encoding layer according to the above division form. This method divides the document into three different hierarchical structures: words, sentences, and paragraphs, and encodes and decodes the Bert model layer by layer for the three types. It can divide the entire document into multiple parts to improve the accuracy of model corpus classification.
3.3.2. Multi-Level Self-Attention Layer
In order to extract effective features from the multi-level Bert encoding layer, the multi-level attention fusion mechanism adopts a self attention mechanism to extract features from the word-encoding layer, sentence-encoding layer, and paragraph encoding layer, respectively. Then, the BiGRU model is used for sequential and reverse learning and concatenation, effectively dividing the technical fields, technical problems, technical solutions, and technical effects in the document. In order to identify and solve the gradient vanishing problem in RNN networks, the GRU neural network optimizes the LSTM model structure and merges the three gating units in the LSTM model into two gating units: the update gate and the reset gate. The specific basic network units are shown in the following figure. Among them, the update gate zt controls the state information at time t − 1 to enter the state at time t. The calculation formula for the hidden layer state of the sequence before resetting the gate rt control is shown in Equations (1) and (2).
where
Wxr represents the weight matrix between input
xi, the reset gate, and the update gate, respectively, and
Wxz represents the weight matrix between them;
Whr and
Whz represent the weight matrices between h
t−1 and the reset and update gates, respectively;
br and
bz are the corresponding bias values; σ is a sigmoid function used to convert the calculation result to between [0, 1]. The value of the updating gate represents the state information introduced from the previous moment. The reset gate represents the state information written in the previous moment. Furthermore, the candidate state information
ht of the current node can be obtained in Equations (3) and (4).
where ∘ represents the Hadamard product,
wh and uh represent the weight matrix, and bh represents the corresponding bias; The activation function tanh scales the entire data to between [−1, 1].
BiGRU processes text vectorized semantic features in chronological and reverse order based on GRU, and concatenates the GRU outputs of each word into the final output. Thus, at the current time t, the calculation formulas for the forward node hidden layer state output and the backward node hidden layer state output are as follows in Equations (5) and (6). Finally, the concatenation at time t yields the final hidden layer state result.
3.4. A Keyword Extraction Method Based on BiLSTM and CRF Fusion
In order to improve the accuracy of extracting keywords from sentences, CNN represented by LSTM is continuously applied in keyword extraction by utilizing the temporal relationships of various words within the sentence. The keyword extraction method based on the fusion of BiLSTM and CRF mainly uses BiLSTM to extract features from four different categories: technical field, technical problem, technical solution, and technical effect. Then, CRF (conditional random field) is used to analyze the intrinsic label relationship and select suitable keywords.
As shown in
Figure 2, the entire method is roughly divided into four layers, namely the embedding layer, BiLSTM layer, CRF layer, and decoding layer. The embedding layer can directly use the sentence vectors obtained from the multi-level Bert encoding layer as input for the model. The BiLSTM layer performs sequential and reverse LSTM learning on the encoded word vectors to obtain label vectors at various positions, thereby identifying the dependency relationships between words. At the same time, keyword labels are used for supervised learning of BiLSTM to learn the rules of keyword recognition. The CRF layer resets multiple label sequences based on label vectors and identifies the path with the highest probability and best performance. Finally, decode the path to obtain keywords for each category. CRF predicts keyword labels in the sequence based on transition probability and emission probability, thereby identifying the dependency relationship between labels and improving the accuracy of keyword recognition.
3.5. Framework of the Proposed Method
The whole procedures are mainly divided into four phases. First of all, the data collection mainly collects patent datasets and annotate related keywords, while data preprocessing standardizes the patent text and identifies stop words. Secondly, the corpus classification model based on multi-level feature fusion mainly divides words, sentences, and paragraphs according to the content of the patent specification for multi-level classification and uses Bert for encoding. Then, the multi-level feature fusion corpus classification model utilizes attention mechanism and Bi GRU network to learn features of three types, dividing each sentence into technical fields, technical problems, technical solutions, and technical effects. Finally, the keyword extraction method based on the fusion of BiLSTM and CRF is to extract the corresponding keywords using the BiLSTM and CRF fusion methods for the four statements mentioned above, and output the corresponding classification keywords.
5. Conclusions
Most existing keyword extraction approaches mainly extract short texts including abstracts or titles, but ignore the expression of the whole patent. Therefore, this paper simulates the understanding styles of human patent examiners and divides the description of a patent document into four aspects: technical field, technical problem, technical solution, and technical effect. Before the keyword extraction, a proposed corpus classification model based on multi-level feature fusion adopts multi-level Bert encoding layers and multi-level self-attention layers according to the words, sentences and paragraphs for sentence division of above aspects. According to above division, a BiLSTM-CRF algorithm based on Bert corpus classification is proposed to extract keywords from above four aspects. The experimental results validate that the proposed mechanism can improve the accuracy of keyword extraction and have a better performance of the low number of categories. In addition, PKECC simplifies the long text processing style and extracts more related keywords, with a recall rate of 81.75%, an accuracy rate of 84%, and an F1 score of 84%.
Although PKECC achieves a good performance in the patent keyword extraction, some problems are still waiting to be solved. Firstly, the generalization ability of proposed model should be improved. The performance of PKECC relies on a number of marked training sets, but the quantity and imbalance of datasets marked by professionals tend to limit keyword extraction. Thus, the keyword extraction approaches should expand the quantity of training sets and solve the imbalance of datasets to ensure better generalization ability. Secondly, more superior keyword extraction methods should be introduced. In fact, the proposed PKECC performs worse on the high number of categories. The patent keyword extraction of datasets tend to have a higher number of categories. It is necessary to incorporate more superior keyword extraction on multi-classification datasets. Thirdly, large-scale patent annotation datasets should be considered. The model training of deep learning needs a large number of annotated patent datasets. Meanwhile, manual annotating standards have a significant difference during the patent annotation. However, existing articles tend to lack standardized patent annotated datasets, which makes it hard to provide a fair comparison. Therefore, the standardized patent annotated datasets have become a promising development direction.