A Patent Keyword Extraction Method Based on Corpus Classification

Sun, Changjian; Chen, Wentao; Zhang, Zhen; Zhang, Tian

doi:10.3390/math12071068

Open AccessArticle

A Patent Keyword Extraction Method Based on Corpus Classification

¹

Changzhou Intelllectual Property Protection Center, Changzhou 213100, China

²

Changzhou Intellectual Property Research Institute, Changzhou, 213100, China

³

School of Physics, Northeast Normal University, Changchun 130024, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(7), 1068; https://doi.org/10.3390/math12071068

Submission received: 4 March 2024 / Revised: 22 March 2024 / Accepted: 26 March 2024 / Published: 2 April 2024

(This article belongs to the Special Issue Advances in Analysis and Application of Mathematical Optimization Algorithms)

Download

Browse Figures

Versions Notes

Abstract

:

The keyword extraction of patents is crucial for technicians to master the trends of technology. Traditional keyword extraction approaches only handle short text like title or claims, but ignore the comprehensive meaning of the description. This paper proposes a novel patent keyword extraction method based on corpus classification (PKECC), which simulates the patent understanding methods of human patent examiners. First of all, a corpus classification model based on multi-level attention mechanism adopts the Bert model and hierarchical attention mechanism to classify the sentences of patent description into four parts including technical field, technical problem, technical solution, and technical effect. Then, the proposed keyword extraction method based on the fusion of BiLSTM and CRF is incorporated to extract keywords from the four parts. The proposed PKECC simulates understanding style of patent examiner by extracting keywords from the description. Meanwhile, PKECC may reduce the complexity of extracting keywords from a long text and improve the accuracy of keyword extraction. The proposed PKECC is compared with 5 traditional or state-of-the-art models and achieves better accuracy, F1 score and recall rate; its recall rate is above 62%, its accuracy reaches over 84%, and the F1 score arrives at 69%. In addition, the experimental results shows the proposed PKECC has a better universality in keyword extraction.

Keywords:

patent keyword extraction; corpus classification; BiLSTM; CRF; hierarchical attention mechanism; Bert

MSC:

68T50

1. Introduction

Patents have become the main carrier of scientific and technological progress and innovation [1]. Technicians not only understand the development status of new products and technologies, but also obtain inspiration and base new products and technologies on patent literature [2]. However, the explosively growing amount of patent literature hinders patent analysts finding valuable patent literature quickly and accurately. Patent analysts have to consider more technical keywords and classification codes for more suitable patents. Thus, the keyword extraction method helps patent analysts find more state-of-the-art patents and technologies.

Keyword extraction is an important task in patent mining, which extracts keywords from the patent and classifies the patents according to certain rules. Traditional keyword extraction approaches classify the patents with classification codes such as IPC and CPC according to similar technical areas [3]. However, classification codes are hard to use to summarize the patent information in detail due to a large number of applications and the potential complexity of an invention. In addition, patent administration tends to assign some classification codes according to technical field and application field, which ignores more valuable information such as creative thinking and techniques. Therefore, various natural language processing methods have been applied in the patent field for better automatic patent classification. These patent keyword extraction methods can be roughly divided into unsupervised keyword extraction methods and supervised keyword extraction methods.

Unsupervised keyword extraction methods mainly incorporate statistical analysis such as N-gram, TF, IDF, word frequency, or graph models to analyze the importance of words. Florescu combines TF and IDF metrics to calculate the ratio of candidate word frequency and inverse document frequency to obtain candidate word weights [4] and then sorts the candidate words to achieve keyword determination. Hassoud et al. [5] calculated the average value based on the position of candidate keywords, thereby achieving the value of candidate keyword weights. Differently from statistical analysis, Mihalcae [6] proposes using keyword co-occurrence relationships in fixed windows to establish internode connections and using PageRank to update node weights to extract keywords from candidate words. In addition, various researches have introduced additional information to update weights of graph models, such as point mutual information [7], word meaning information [8], location information, and topic information [9], in order to effectively identify keywords. However, existing unsupervised keyword extraction methods not only ignore low-frequency words and highly related words [10], but also have difficulty extracting main technical words [11].

Therefore, supervised keyword extraction methods have been introduced to utilize machine learning algorithms or deep learning models to transform keyword extraction tasks into encoding or binary classification tasks. Yang et al. [12] used a neural network model to extract features from candidate keywords and then used a label classification layer based on the Softmax function to determine candidate keywords. Duari et al. [13] constructed a naive Bayesian model in advance and combined features such as node strength, position rank, and clustering coefficient to classify candidate words and obtain keywords. Wei et al. [14] extracted candidate keywords using long short-term memory (LSTM) neural networks and logistic regression models and set certain recombination filtering rules to improve the model’s recognition of low-frequency and long tail keywords. The KEA method proposed by Frank [15] uses a naive Bayesian model to classify candidate words based on their TF-IDF values and positional information; Wang et al. [16] used support vector machines to filter keywords based on word frequency and positional information. Haddoud et al. [17] used logistic regression algorithms to extract keywords based on word length and frequency. Meng et al. [18] proposed CopyRNN, based on the Seq2Seq model, which uses a recursive neural network to compress the semantic information of a given text into dense vectors, and then decodes this vector into keywords that exist in the target vocabulary. Zhang et al. [19] classified text words into positive and negative categories based on LSTM, and then trained an LSTM model to classify the words in the text to obtain the keywords. Keyword extraction was directly modeled as a binary classification task for words.

In the process of patent examination, the patent examiner needs to understand by comprehensively considering technical field, technical problem, technical solutions, and technical effects from a description. Then, the patent examiner needs to extract sufficient technical keywords from the above four parts. However, the existing keyword extraction methods only extract keywords from abstracts or titles, which ignores the expression of the whole patent. In addition, the existing keyword extraction methods also have difficulty handling long texts, such as descriptions of patents. Therefore, this paper proposes a patent keyword extraction method based on corpus classification, termed “PKECC”, as a simulation of a patent examiner. The main contributions are listed below.

A corpus classification model based on multi-level feature fusion is incorporated into sentence division of description. Since the patent description is related to long texts, a corpus classification model based on multi-level feature fusion adopts multi-level Bert encoding layers and multi-level self-attention layers according to the words, sentences and paragraphs for sentence division to simplify the process of keyword extraction.
A keyword extraction model based on BiLSTM and CRF fusion is proposed to extract keywords from divided sentences. Although BiLSTM could extract a number of keywords, the performance of keyword extraction is limited. Thus, the keyword extraction based on BiLSTM and CRF adopts BiLSTM for more comprehensive semantic features and CRF model for better keyword prediction.
The proposed PKECC method is compared with 5 traditional or state-of-the-art models on three types of patent datasets. The results verify that the proposed PKECC achieves better accuracy, F1 score, and recall rate on low category patent datasets.

This paper is organized as follows. Section 2 introduces works relating to the Bert model, hierarchical attention mechanism, and BiLSTM model. Section 3 gives the details of the proposed keyword extraction method. Section 4 presents the experimental results from related patent datasets. Finally, the concluding remarks are given in Section 5.

2. Related Work

2.1. Bert

Due to the fact that a unique encoding method cannot fully explore the semantic features of a text, word embedding patterns, represented by Bert (bidirectional encoder representations from transformers) and Word2Vec, are gradually gaining ground. The Bert model was originally launched by Google in 2018 as a pre-trained language representation model based on the Transformer architecture, which significantly promoted the development of language understanding [20]. The Bert model uses a 12-layer Transformer architecture to train the model on a large number of general corpora, utilizing the context of all encoding layers to train deep bi-directional representations.

Compared to Word2Vec, which is represented by static word vectors, the Bert model, which is represented by dynamic word vectors, converts each word into three types of vectors after obtaining input text, including word vectors, sentence vectors, and position vectors. The word vector is transformed from each word in the text by the Bert model; sentence vectors are used to represent the global semantic information of text and integrate it with the semantic information of characters; position vectors are used to represent the semantic information differences carried by words that appear at different positions in the text.

Transformer is a deep learning architecture proposed by Vaswani in 2017. Transformer relies on a multi-head attention mechanism to capture contextual information in input sequences, making it highly suitable for processing sequential data such as text. It is worth noting that it does not contain recurrent units, so it requires less training time than previous recurrent neural architectures such as LSTM. Generally, the multi-head attention mechanism of the Transformer architecture includes encoder self-attention in the encoder, decoder self-attention in the decoder, and encoder decoder attention. The attention module will perform parallel and repeated calculations multiple times; each time is referred to as an attention head. Ultimately, all these attention head calculations are merged together to form the final attention score. Transformer can better capture the multiple relationships and subtle differences between each word, which is crucial for capturing long-distance dependencies and understanding the context in which vocabulary appears.

The steps of Bert are divided into the pre-training stage and the fine-tuning stage [21]. During the pre-training phase, bi-directional pre-training is performed on a large quantity of unlabeled text data. The fine-tuning stage initializes all parameters using the pre-trained model, and for specific tasks, labeled data is required for model training parameters. Different downstream tasks can train different models, allowing the model to be flexibly applied to different tasks without the need to train from scratch.

Although Bert can be considered the best model among current NLP algorithms, it still has its drawbacks. For example, the mismatch between pre-training and fine-tuning, as well as the need for more training steps, requires a significant amount of computation. Various improved versions of Bert models have been proposed, including improving training methods, optimizing model structure, and miniaturizing models.

In addition, traditional Bert models perform poorly in long texts, of which patent descriptions are a form. Therefore, extracting keywords from long texts has become a focus of this paper. In this paper, the proposed PKECC adopts a multi-level Bert model, which divides the description into word-level, sentence-level and paragraph-level Bert models for simplifying the whole Bert encoding model.

2.2. Hierarchical Attention Mechanism

In the field of long text classification, Yang et al. proposed a hierarchical attention mechanism [22]. This hierarchical attention mechanism is mainly divided into a word-level attention layer and sentence-level attention layer when constructing the classifier model, and then uses a bi-directional recurrent neural network and traditional attention mechanism for information extraction.

The hierarchical attention mechanism divides long texts into multiple sentences in advance. The word level attention layer takes the word embedding matrix formed from sentences composed of words as the input of the word level attention layer. It uses a bi-directional recurrent neural network to learn the contextual information of words in the sentence, and then identifies keywords through attention mechanisms. The sentence level attention layer combines the word vectors obtained from the word level attention layer to form a sentence vector. The sentence level attention layer also uses a bi-directional recurrent neural network to learn the relationships between sentences and combines attention mechanisms to determine key sentences. Finally, the classification of text categories is achieved through the softmax function.

Although hierarchical attention mechanism may simplify the complexity of the model effectively, different length of word vectors and relationship are difficult to identify and develop. Different from traditional attention mechanism, self-attention mechanism finds interrelationship among word vectors better. Therefore, this paper incorporates self-attention mechanism into hierarchical attention mechanism for better classification performance.

2.3. BiLSTM

LSTM (long short-term memory) is a form of RNN (recurrent neural network). LSTM was proposed by Hochreiter and Schmidhuber in 1997, aiming to overcome the problem of vanishing gradients during RNN training and capture long-term dependencies in sequence data. Similar to RNN, the traditional LSTM model includes memory units, input gates, forget gates, and output gates. LSTM could find the relationships between input text and solve the problems of gradient explosion and gradient disappearance on long text. While logistic regression performs better on classification problem. LSTM is skilled at processing sequential data. In addition, logistic regression mainly classifies according to the maximum likelihood of probability, but LSTM originates from RNN. However, LSTM still lacks the ability to encode information from back to front in statement modeling.

BiLSTM (bi-directional long short-term memory) is a recurrent neural network (RNN) architecture that extends traditional LSTM models by processing input sequences in both forward and backward directions. The BiLSTM architecture extends the LSTM architecture by processing input sequences in both forward and backward directions, allowing the network to capture past and future dependencies that may exist at each time step. This architecture consists of two LSTM layers: a forward LSTM layer and a backward LSTM layer. The forward LSTM layer processes the input sequence from the first time step to the last time step. Each time step involves interaction with input gates, forget gates, and output gates to capture relevant information about the sequence. The hidden state of the forward LSTM layer represents the accumulated information from the input sequence up to the current time step. The backward LSTM layer processes the input sequence in reverse order, starting from the last time step and moving towards the first time step, which can capture the opposite dependency relationship as the forward LSTM layer. The hidden state of the backward LSTM layer presents the information from the input sequence to the current time step, but the order is opposite to that of the forward LSTM layer. By combining the hidden states of the forward and backward LSTM layers at each time step, the model obtains a comprehensive understanding of the input sequence. This bi-directional information flow is very valuable for scenarios where bi-directional context is crucial in tasks such as speech recognition, machine translation, and sentiment analysis.

Although BiLSTM has advantages in capturing bi-directional dependencies, BiLSTM only chooses labels with a high probability, which ignores the sequence of labels. Thus, the proposed PKECC adopts BiLSTM with the CRF model to make use of sequence annotation in the keyword extraction task. PKECC combines the strong semantic vector representation ability of the BiLSTM architecture and the advantages of CRF in learning the transition probability between labels, achieving good results.

3. Methods

This section mainly describes the process of the proposed keyword extraction method based on corpus classification. Firstly, this section mainly introduces the data collection and data preprocessing before the keyword extraction. Then, this section describes a corpus classification model based on a multi-level attention mechanism. Next, this section introduces a keyword extraction method based on the fusion of BiLSTM and CRF. Finally, this section presents the overall steps of the proposed algorithm.

3.1. Data Collection

To verify the performance of proposed model, the patent datasets includes some open datasets and self-annotated patent datasets. Self-annotated datasets are downloaded from Patsnap and other commercial patent datasets. The patent datasets include text, description, sentence types, and annotated keywords. The sentence type are divided into technical field, technical problem, technical solution, and technical effect.

3.2. Data Preprocessing

The downloaded patent datasets contains many duplicate patent texts and inconsistent patent text formats. Moreover, the patent text includes a number of function words, pronouns, verbs, nouns without actual meanings. Therefore, this paper adopts HIT stop words to construct the patent stop dictionary, which consists of publicly available and frequently used dictionaries as well as descriptive words for patents. In addition, this paper sets stop words related to patents based on the patent datasets, such as “including”, “disclosed”, “applicable to”, “present invention”, “combined”. Meanwhile, this paper also adopts the Jieba word segmentation tool on Chinese patents to the fine-grained segmentation and predicative phrase annotation, which serve as inputs to the LSTM layer.

3.3. Corpus Classification Model Based on Multi-Level Feature Fusion

The traditional patent keyword extraction method mainly utilizes the short text content such as abstracts and titles for keyword extraction and classification. Unlike traditional keyword extraction methods, the proposed algorithm requires dividing the corpus in advance into four aspects: technical field, technical problem, technical solution, and technical effect in the patent specification to prepare for subsequent keyword extraction. Therefore, PKECC adopts a hierarchical attention mechanism to process patent classification, which is shown in Figure 1.

As shown in Figure 1, the proposed mechanism is divided into five aspects, including multi-level Bert encoding layer, word-level self-attention layer, sentence-level self-attention layer, paragraph-level self-attention layer and classification layer. First, the multi-level Bert encoding layer divides the input text into a number of paragraphs, sentences and words, and encodes these paragraphs, sentences and words with Bert model. Then, the word-level, sentence-level, and paragraph-level self-attention layers adapt the multi-level self-attention layer to encoding words, sentences, and paragraphs respectively. The multi-level self-attention layer applies forward and backward GRU for feature extraction and self-attention mechanism for better classification performance. Finally, the classification layer adopts the softmax function to classify the whole sentences into technical field, technical problem, technical solution, and technical effect.

3.3.1. Multi-Level Bert Encoding Layer

Patent specifications involve a large quantity of long sentences and textual information, and traditional attention layers cannot effectively classify long texts. Therefore, the multi-level Bert encoding layer divides the entire text into K paragraphs based on words, sentences, and paragraphs, each paragraph consisting of L sentences, and each sentence consisting of T words, where m_ij represents the j-th word of the i-th sentence in the m-th paragraph. Then, the layer is further divided into the word-encoding layer, sentence-encoding layer, and paragraph-encoding layer according to the above division form. This method divides the document into three different hierarchical structures: words, sentences, and paragraphs, and encodes and decodes the Bert model layer by layer for the three types. It can divide the entire document into multiple parts to improve the accuracy of model corpus classification.

3.3.2. Multi-Level Self-Attention Layer

In order to extract effective features from the multi-level Bert encoding layer, the multi-level attention fusion mechanism adopts a self attention mechanism to extract features from the word-encoding layer, sentence-encoding layer, and paragraph encoding layer, respectively. Then, the BiGRU model is used for sequential and reverse learning and concatenation, effectively dividing the technical fields, technical problems, technical solutions, and technical effects in the document. In order to identify and solve the gradient vanishing problem in RNN networks, the GRU neural network optimizes the LSTM model structure and merges the three gating units in the LSTM model into two gating units: the update gate and the reset gate. The specific basic network units are shown in the following figure. Among them, the update gate z_t controls the state information at time t − 1 to enter the state at time t. The calculation formula for the hidden layer state of the sequence before resetting the gate r_t control is shown in Equations (1) and (2).

r_{t} = σ (W_{x r} x_{t} + W_{h r} h_{t - 1} + b_{r}),

(1)

z_{t} = σ (W_{x z} x_{t} + W_{h z} h_{t - 1} + b_{z}),

(2)

where W_xr represents the weight matrix between input x_i, the reset gate, and the update gate, respectively, and W_xz represents the weight matrix between them; W_hr and W_hz represent the weight matrices between h_t₋₁ and the reset and update gates, respectively; b_r and b_z are the corresponding bias values; σ is a sigmoid function used to convert the calculation result to between [0, 1]. The value of the updating gate represents the state information introduced from the previous moment. The reset gate represents the state information written in the previous moment. Furthermore, the candidate state information h_t of the current node can be obtained in Equations (3) and (4).

s_{t} = \tanh (W_{x s} x_{t} + {r t}^{\circ} (W_{l s} h_{t - 1}) + b_{l s},

(3)

h_{t} = (1 - z_{t}) \circ s_{t} + z_{t} \circ h_{t - 1},

(4)

where ∘ represents the Hadamard product, w_h and uh represent the weight matrix, and bh represents the corresponding bias; The activation function tanh scales the entire data to between [−1, 1].

BiGRU processes text vectorized semantic features in chronological and reverse order based on GRU, and concatenates the GRU outputs of each word into the final output. Thus, at the current time t, the calculation formulas for the forward node hidden layer state output and the backward node hidden layer state output are as follows in Equations (5) and (6). Finally, the concatenation at time t yields the final hidden layer state result.

{\vec{h}}_{t} = \vec{G R U} (x_{t}, {\vec{h}}_{t - 1}),

(5)

{\overset{\leftarrow}{h}}_{t} = \overset{\leftarrow}{G R U} (x_{t}, {\overset{\leftarrow}{h}}_{t - 1})

(6)

3.4. A Keyword Extraction Method Based on BiLSTM and CRF Fusion

In order to improve the accuracy of extracting keywords from sentences, CNN represented by LSTM is continuously applied in keyword extraction by utilizing the temporal relationships of various words within the sentence. The keyword extraction method based on the fusion of BiLSTM and CRF mainly uses BiLSTM to extract features from four different categories: technical field, technical problem, technical solution, and technical effect. Then, CRF (conditional random field) is used to analyze the intrinsic label relationship and select suitable keywords.

As shown in Figure 2, the entire method is roughly divided into four layers, namely the embedding layer, BiLSTM layer, CRF layer, and decoding layer. The embedding layer can directly use the sentence vectors obtained from the multi-level Bert encoding layer as input for the model. The BiLSTM layer performs sequential and reverse LSTM learning on the encoded word vectors to obtain label vectors at various positions, thereby identifying the dependency relationships between words. At the same time, keyword labels are used for supervised learning of BiLSTM to learn the rules of keyword recognition. The CRF layer resets multiple label sequences based on label vectors and identifies the path with the highest probability and best performance. Finally, decode the path to obtain keywords for each category. CRF predicts keyword labels in the sequence based on transition probability and emission probability, thereby identifying the dependency relationship between labels and improving the accuracy of keyword recognition.

3.5. Framework of the Proposed Method

The whole procedures are mainly divided into four phases. First of all, the data collection mainly collects patent datasets and annotate related keywords, while data preprocessing standardizes the patent text and identifies stop words. Secondly, the corpus classification model based on multi-level feature fusion mainly divides words, sentences, and paragraphs according to the content of the patent specification for multi-level classification and uses Bert for encoding. Then, the multi-level feature fusion corpus classification model utilizes attention mechanism and Bi GRU network to learn features of three types, dividing each sentence into technical fields, technical problems, technical solutions, and technical effects. Finally, the keyword extraction method based on the fusion of BiLSTM and CRF is to extract the corresponding keywords using the BiLSTM and CRF fusion methods for the four statements mentioned above, and output the corresponding classification keywords.

4. Experiment

4.1. Performance Metrics

The performance metrics are essential to validating the performance of keyword extraction approaches and PKECC. In this paper, the experiment chooses accuracy, precision, recall, and F1 score as evaluation metrics of the comparison algorithms from the aspect of numerical value. The evaluation equations are listed in Equations (7)–(10). Accuracy mainly measures the distinctive characterization of the extracted keywords between the comparison methods and PKECC. Meanwhile, precision calculates the actual positive samples of the predicted positive samples. Moreover, the recall number validates the selected positive samples when classifying document labels. In addition, F1 score mainly keeps a balance between precision and recall.

a c c u r a c y = \frac{# of correctly predicted documents}{total number of documents},

(7)

p r e c i s i o n = \frac{# of correctly predicted documents}{# of documents predicted as positive},

(8)

r e c a l l = \frac{# of correctly predicted documents}{# of positive documents},

(9)

F 1 = \frac{2 \times p r e c i s i o n \times r e c a l l}{p r e c i s i o n + r e c a l l} .

(10)

4.2. Parameter Settings

In this section, PKECC is compared with a series of classical and state-of-the-art keyword extraction methods including normalized term frequency–inverse document frequency (NTF-IDF) [23], TextRank [6], keyBert [24], PATENTBert [20] and YAKE [25]. For each experiment, the whole datasets are randomly divided into two sets, 70% for training and 30% for testing. Meanwhile, the 10-fold cross validation is applied to each training set in order to improve the reliability of the training process.

To better evaluate the performance of automatically extracted keywords, the experiment adopts the human-annotated keywords as labels. In addition, the experiment provides the sentences of some patents with labels of technical field, technical issues, technical solutions, and technical effects. In addition, the proposed PKECC also adopts zero–one loss function to accelerate convergence, which is listed in Equation (11).

l o s s (x, y) = \frac{1}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} L (y_{i j}, f (x_{i j})),

(11)

L (y_{i j}, f (x_{i j})) = {\begin{cases} 1, y_{i j} \neq f (x_{i j}) \\ 0, y_{i j} = f (x_{i j}) \end{cases},

(12)

wherein x denotes input text, y represents true keyword, n denotes the number of patents, m is the number of keywords in the current patent, x_ij means j-th keyword of the i-th patent, and

f (x_{i j})

denotes the chosen keywords; meanwhile, as listed in Equation (12),

L (y_{i j}, f (x_{i j}))

means that the value is added when a chosen keyword is different from the true keyword.

4.3. Experimental Results on Patent Keywords Extraction

To validate the performance of PKECC, this subsection chooses 3 datasets as experimental datasets. The details of the datasets are shown in Table 1. “Semiconductor” includes 3 categories based on group number of IPC related to semiconductors and comprises 1000 randomly chosen semiconductor-related patents released in 2020 by the United States Patent and Trademark Office [26]. The keyword related to the patent is determined by querying the terms “semiconductor” on the class and subclass descriptions on the CPC classification scheme. “Lithium battery” and “integrated circuit package” are chosen from the public patent navigation library of Patsnap. The experiment only randomly chose 1000 pieces of patent data from the public patent navigation library. The experiment randomly chooses equivalent patents from each category. The patent datasets include the title, publication number, and description of the patents. Meanwhile, “lithium battery” includes 10 categories and “integrated circuit package” has 8 categories. In addition, we assess the representativeness of the extracted keywords by classifying document groups using the selected keywords. Each patent is mainly given 4 or 5 keywords. For Chinese patents, the experiment mainly adopts Section 3.2 to set Chinese-related stop words, and Jieba word segmentation tool to the fine-grained segmentation and predicative phrase annotation.

This experiment compares PKECC with some keyword extraction methods based on word frequency, such as NTF-IDF and TextRank, and unsupervised methods including keyBert, PATENTBert, and YAKE. The accuracy, precision, recall and F1 score on test sets are listed in Table 2, Table 3, Table 4, and Table 5 respectively. In addition, the bold value denotes the best value among the comparison methods and PKECC.

PKECC achieves better classification performance than the comparison methods based on word frequency. In Table 2, the results indicate that PKECC extracts keywords more accurately than the word frequency-based methods. Moreover, PKECC keeps a better balance between precision and recall according to Table 4.

Compared with unsupervised methods, PKECC also evinces better performance on keyword extraction. As shown in Table 2, Table 4 and Table 5, the proposed PKECC achieve better performance metrics than keyBert, PATENTBert, and YAKE. Meanwhile, PKECC only scores slightly lower in precision than YAKE. The result explains why PKECC performs better in low categories of datasets.

Meanwhile, the experiment gives the accuracy of the validation sets in Table 6. Similar to Table 2, the bold value represents the best value among the comparison methods and PKECC under same datasets. As shown in Table 6, PKECC achieves a higher validation accuracy than comparison methods. Compared with methods based on word frequency including NTF-IDF and TextRank, PKECC achieves a higher validation accuracy, especially with integrated circuit package. In addition, the unsupervised methods such as keyBert, PATENTBert and YAKE perform better than methods based on word frequency. Similarly, PKECC does better on validation sets than unsupervised methods do.

For better representation of extracted keywords, Table 7 lists keyword extraction results of proposed PKECC which are classified as H01L21. The title of “Label” denotes the keywords of test patent and the bold value represents extracted incorrect keywords of PKECC. As shown in Table 7, the proposed PKECC extracts most keywords from the example patents. In fact, patent 1 refers to “HEMT” only in the description, but PKECC extracted this keyword from the patent. Although PKECC only extracts 4 correct keywords among 5 keywords in Patent 1, Patent 3, and Patent 5, PKECC extracts many related keywords. Thus, the proposed PKECC has an advantage of extracting some theme-related keywords.

4.4. Analysis of the Proposed Strategies

The proposed PKECC mainly adopts the corpus classification model based on multi-level attention mechanism and BiLSTM-CRF to solve the keyword extraction problem from the sentence sets including technical field, technical problems, technical solutions, and technical effects. To verify the effectiveness of the proposed model, this subsection chooses accuracy, recall, and F1 score as the measuring indicators. The related results are shown in Figure 3. Figure 3 chooses semiconductor as main dataset. In Figure 3, the experiment chooses the CRF model, BiLSTM-CRF, and Bert-CRF as comparison models. Unlike PKECC, the comparison models choose the description of patent as the input of model. As shown in Figure 3, BiLSTM-CRF obtains a higher accuracy and F1 score than comparison models. However, the proposed BiLSTM-CRF has a slightly weaker performance than BiLSTM-CRF. The result indicates that PKECC can improve the accuracy of keyword extraction methods.

In addition, the training performance of proposed PKECC are listed in Figure 4 and Figure 5. Figure 4 shows the trend of training loss and validation loss of PKECC on semiconductor dataset. The training loss of PKECC tends to stabilize at the 15th epochs and validation loss converges at 11th epochs. Figure 5 shows the trend of validation accuracy on the semiconductor dataset. The validation accuracy of PKECC on the semiconductor dataset increases quickly in the first fifteenth iterations and completes convergence at the 30th iteration.

5. Conclusions

Most existing keyword extraction approaches mainly extract short texts including abstracts or titles, but ignore the expression of the whole patent. Therefore, this paper simulates the understanding styles of human patent examiners and divides the description of a patent document into four aspects: technical field, technical problem, technical solution, and technical effect. Before the keyword extraction, a proposed corpus classification model based on multi-level feature fusion adopts multi-level Bert encoding layers and multi-level self-attention layers according to the words, sentences and paragraphs for sentence division of above aspects. According to above division, a BiLSTM-CRF algorithm based on Bert corpus classification is proposed to extract keywords from above four aspects. The experimental results validate that the proposed mechanism can improve the accuracy of keyword extraction and have a better performance of the low number of categories. In addition, PKECC simplifies the long text processing style and extracts more related keywords, with a recall rate of 81.75%, an accuracy rate of 84%, and an F1 score of 84%.

Although PKECC achieves a good performance in the patent keyword extraction, some problems are still waiting to be solved. Firstly, the generalization ability of proposed model should be improved. The performance of PKECC relies on a number of marked training sets, but the quantity and imbalance of datasets marked by professionals tend to limit keyword extraction. Thus, the keyword extraction approaches should expand the quantity of training sets and solve the imbalance of datasets to ensure better generalization ability. Secondly, more superior keyword extraction methods should be introduced. In fact, the proposed PKECC performs worse on the high number of categories. The patent keyword extraction of datasets tend to have a higher number of categories. It is necessary to incorporate more superior keyword extraction on multi-classification datasets. Thirdly, large-scale patent annotation datasets should be considered. The model training of deep learning needs a large number of annotated patent datasets. Meanwhile, manual annotating standards have a significant difference during the patent annotation. However, existing articles tend to lack standardized patent annotated datasets, which makes it hard to provide a fair comparison. Therefore, the standardized patent annotated datasets have become a promising development direction.

Author Contributions

Conceptualization, W.C. and C.S.; methodology, W.C. and C.S.; investigation, Z.Z.; resources, T.Z.; writing—original draft, W.C. and C.S.; writing—review & editing, T.Z., C.S. and Z.Z.; funding acquisition, C.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Changzhou Sci & Tech Program under grant CJ20235007 and Jiangsu Province Intellectual Property Science Research Project under grant KY20230066-9.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors thank the anonymous reviewers whose comments helped us improve the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Higuchi, K.; Yanai, K. Patent image retrieval using transformer-based deep metric learning. World Pat. Inf. 2023, 74, 102217. [Google Scholar] [CrossRef]
Zhao, X.; Zhang, X. Research and demonstration of technology opportunity identification model based on text classification and core patents. Comput. Ind. Eng. 2022, 171, 108403. [Google Scholar] [CrossRef]
Noh, H.; Jo, Y. Keyword selection and processing strategy for applying text mining to patent analysis. Expert Syst. Appl. 2015, 42, 4348–4360. [Google Scholar] [CrossRef]
Florescu, C.; Carafea, C. A New Scheme for Scoring Phrases in Unsupervised Keyphrase Extraction. In Proceedings of the Advances in Information Retrieval: 39th European Conference on IR Research, ECIR 2017, Aberdeen, UK, 8–13 April 2017; pp. 477–483. [Google Scholar]
Haddoud, M.; Abdeddaïm, S. Accurate Keyphrase Extraction by Discriminating Overlapping Phrases. J. Inf. Sci. 2014, 40, 488–500. [Google Scholar] [CrossRef]
Mihalcea, R.; Tarau, P. TextRank: Bringing Order into Texts. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 404–411. [Google Scholar]
Yang, T.; Zhu, C.; Zhang, J. Research on Keyword Extraction Algorithm Using PMI and Text Rank. In Proceedings of the 2019 IEEE 2nd International Conference on Information and Computer Technologies (ICICT), Kahului, HI, USA, 14–17 March 2019; pp. 5–9. [Google Scholar]
Zhao, D.; Du, N.; Chang, Z.; Li, Y. Keyword Extraction for Social Media Short Text. In Proceedings of the 2017 14th Web Information Systems and Applications Conference (WISA), Liuzhou, China, 11–12 November 2017; pp. 251–256. [Google Scholar]
Liu, Z.; Huang, W.; Zheng, Y.; Sun, M. Automatic Key Phrase Extraction via Topic Decomposition. Assoc. Comput. Linguist. 2010, 366–376. [Google Scholar]
Seol, H.; Lee, S. Identifying new business areas using patent information: A DEA and text mining approach. Expert Syst. Appl. 2011, 38, 2933–2941. [Google Scholar] [CrossRef]
Zhou, P.; Jiang, X.; Zhao, S. Unsupervised technical phrase extraction by incorporating structure and position information. Expert Syst. Appl. 2024, 245, 123140. [Google Scholar] [CrossRef]
Yang, D.H.; Wu, Y.X.; Fan, C.X. Chinese short text keyphrase extraction model based on attention. Comput. Sci. 2020, 47, 193–198. [Google Scholar]
Duari, S.; Bhatnagar, V. Complex network based supervised keyword extractor. Expert Syst. Appl. 2020, 140, 112876. [Google Scholar] [CrossRef]
Wei, T.T.; Jiang, T.; Zhen, S.L.; Zhang, J.T. Extracting Chinese patent keywords with LSTM and logistic regression. Data Anal. Knowl. Discov. 2022, 6, 308–317. [Google Scholar]
Frank, E.; Paynter, G.W.; Witten, I.H.; Gutwin, C.; Nevill-Manning, C.G. Domain-Specific Keyphrase Extraction. In Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 31 July–6 August 1999; pp. 668–673. [Google Scholar]
Wang, J.; Peng, H. Keyphrases extraction from Web document by the least squares support vector machine. In Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI’05), Compiegne, France, 19–22 September 2005; pp. 293–296. [Google Scholar]
Haddoud, M.; Mokhtari, A.; Lecroq, T.; Abdeddaim, S. Accurate Keyphrase Extraction from Scientific Papers by Mining Linguistic Information. In Proceedings of the First Workshop on Mining Scientific Papers: Computational Linguistics and Bibliometrics, CEUR Workshop Proceedings, Istanbul, Turkey, 29 June 2015; pp. 12–17. [Google Scholar]
Meng, R.; Zhao, S.; Han, S.; He, D.; Chi, Y. Deep keyphrase generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Association for Computational Linguistics: Kerrville, TX, USA, 2017; pp. 582–592. [Google Scholar]
Zhang, Y.; Tuo, M.; Yin, Q.; Qi, L.; Liu, T. Keywords extraction with deep neural network model. Neurocomputing 2020, 383, 113–121. [Google Scholar] [CrossRef]
Lee, J.; Hsiang, J. Patent classification by fine-tuning Bert language model. World Pat. Inf. 2020, 61, 101965. [Google Scholar] [CrossRef]
Ningsih, A.K.; Hadiana, A.I. Disaster Tweets Classification in Disaster Response using Bidirectional Encoder Representations from Transformer (Bert). IOP Conf. Ser. Mater. Sci. Eng. 2021, 1115, 012032. [Google Scholar] [CrossRef]
Jianhua, R.; Jing, L. Document Classification Method Based on Context Awareness and Hierarchical Attention Network. J. Front. Comput. Sci. Technol. 2021, 15, 305. [Google Scholar]
Trappey, A.J.; Trappey, C.V. IP portfolios and evolution of biomedical additive manufacturing applications. Scientometrics 2017, 111, 139–157. [Google Scholar] [CrossRef]
Grootendorst, M. KeyBert: Minimal Keyword Extraction with Bert. 2020. Available online: https://github.com/MaartenGr/KeyBERT (accessed on 29 February 2024).
Ricardo, C.; Vítor, M. YAKE! Keyword extraction from single documents using multiple local features. Inf. Sci. 2020, 509, 257–289. [Google Scholar]
Shin, H.; Lee, H.J.; Cho, S. General-use unsupervised keyword extraction model for keyword analysis. Expert Syst. Appl. 2023, 233, 120889. [Google Scholar] [CrossRef]

Figure 1. The procedure of the corpus classification model based on multi-level feature fusion.

Figure 2. The procedure of the keyword extraction method based on BiLSTM and CRF fusion.

Figure 3. The average value of accuracy, recall and F1 of the proposed strategies.

Figure 4. Training loss and validation loss of PKECC on semiconductor dataset.

Figure 5. Validation accuracy of PKECC on semiconductor dataset.

Table 1. The categories and quantity of datasets.

Source	Category	Number of Patents
Semiconductor	H01L21; method or equipment	300
	H01L23; components	300
	H01L29; device, e.g., transistor	400
Lithium battery	Lithium iron phosphate	100
	Lithium cobalt oxide	100
	Lithium manganate	100
	Nickel cobalt manganese	100
	Lithium manganese phosphate	100
	Carbon	100
	Silicon	100
	Polypropylene	100
	Polyethylene	100
	Ceramics	100
Integrated circuit package	Formal dress	125
	Inversion dress	125
	Wire bonding	125
	Tape automated bonding	125
	Convex points	125
	Embedded integrated circuit Interconnection	125
	Insertion dress	125
	Mounting dress	125

Table 2. The accuracy of the experiment results on test sets.

Datasets	PKECC	NTF-IDF	TextRank	keyBert	PATENTBert	YAKE
Semiconductor	0.840	0.802	0.818	0.822	0.835	0.831
Lithium battery	0.840	0.801	0.822	0.826	0.819	0.835
Integrated circuit package	0.873	0.852	0.843	0.851	0.862	0.868