3.2. Reliable Multi-View Deep Patent Classification
Given a number of patent multi-view inputs , where n is the number of samples and m is the number of patent views. The RMDPC firstly utilizes the backbone parameterized by to extract the features from each view of the input so we have the patent feature embedding . Note that the backbone can be various neural networks without a softmax layer. In this paper, considering the superiority of the processing of text, we use a BERT neural network as the backbone to extract the features from multi-view patent information. Then, the patent feature embedding is fed into the evidential head.
In contrast to existing deep learning-based models that typically use a softmax layer on top of deep neural networks (DNNs) for classification, which produces over-confident outputs in false prediction [
38], the evidential head introduces the evidence framework of the Dempster–Shafer theory (DST) and subjective logic (SL) [
36] to overcome the limitations of softmax-based DNNs. It provides a simple and efficient way to jointly formulate multi-class classification and uncertainty modeling through a minor change that replaces the softmax layer with an activation layer (i.e., ReLu layer) to produce a non-negative output, termed evidence [
14].
Formally, for the
K-classification task, the feature embedding
from the backbone can be transformed into evidence
by the evidential head from each patent view in terms of the following equation:
where
and
is the
v-th evidential function to keep evidence
non-negative. In particular, we assume that the class probability follows a prior Dirichlet distribution
, which is parameterized by
and given by
where
is the
K-dimensional unit simplex,
and
is the
K-dimensional multinomial beta function.
Based on the DST and SL theories, we have the linked to the learned evidence by the equality , where . In the inference, the predicted probability of the -th class is calculated as , where is the Dirichlet strength. Then, we have the predictive uncertainty to represent the vacuity of evidence for each patent view. The evidential head uses the Dirichlet distribution parameterized over evidence to represent the density of such a probability assignment and the predictions of the learner as a distribution over the possible softmax outputs, which models the second-order probabilities to indicate the uncertainty of the neural network results to enable the model to become “know unknown”.
The evidence from a single patent view is collected by each evidential head. Now, we should focus on the patent classification with multiple patent views. Given a set of m evidence for the i-th sample, we devise a simple and efficient aggregation strategy for multi-view deep patent classification with evidence theory, which is shown in Definition 1:
Definition 1 (Evidence-theory-based aggregation strategy for multi-view deep patent classification).The aggregation strategy for multi-view deep patent classification with evidence theory simply consists of evidence parameter addition. Given the i-th sample with m multiple views for the K-classification task, we can obtain a set of evidence , which is collected from m evidential heads. For , , we have , which represents the process of multi-view aggregation, and , which is the aggregated Dirichlet strength.
Following Definition 1, we combine the evidence from multiple patent views into the aggregated representation , where . Then, we have to produce the final probability of each patent class, where . After we obtain the aggregated representation , we discuss how to train our reliable multi-view deep patent classification model.
For the classification task, the loss function aims to minimize the generalization risk , where indicates the predictive class probability, represents the ground truth and refers to the certain loss function, e.g., mean squared error or cross-entropy loss. However, the generalization risk is hard to compute as the data distribution is unknown. The most common approach is to approximate the generalization risk by minimizing the empirical risk on the labeled data, i.e., . In this work, we adopt this method with a cross-entropy loss function.
For conventional neural networks, the cross-entropy loss is usually employed as
where
is the true label and
is the predicted probability of the
i-th sample for class
j. Within our model, given the evidence
of the
i-th sample with
m views obtained from the evidential head, we can obtain the overall evidence
after the fusion of evidence from multiple patent data views. Then, we can obtain the parameter
(e.g.
) and the corresponding class probability
(i.e.,
) of the Dirichlet distribution
. Considering the
as a prior on the likelihood, we modify Equation (
4) into the following form:
where
is the
digamma function and
is the
K-dimensional multinomial
beta function. The modified cross-entropy loss
aims to encourage the evidence of the ground-truth category to reach a large value; however, it cannot guarantee that less evidence will be generated for incorrect categories. To address this limitation, we introduce a Kullback–Leibler (KL) divergence term into our loss function that shrinks the evidence for the incorrect categories to zero, which can regularize the predictive distribution by penalizing those divergences from the “I do not know” state that do not contribute to data fitting. The KL divergence term is as follows:
where
and
is calculated as
where
is a
gamma function,
is a
digamma function, and
means the uniform Dirichlet distribution.
By synthesizing the cross-entropy loss
and the KL divergence regularization term
, the overall optimization problem of our RMDPC model is formulated as
where
is the balance factor. In general, we consider
as a warm-up function, which gradually increases the value of
to prevent the model from paying too much attention to the KL divergence term
in the initial stage of training and avoid the premature convergence to the uniform distribution for the misclassified samples, which may be correctly classified in future epochs.
Here, we also give the overall algorithm framework of the RMDPC method to show the optimization process of our method in Algorithm 1.
Algorithm 1 Algorithm for Reliable Multi-view Deep Patent Classification (RMDPC) |
- 1:
/*Training*/ - 2:
Input: Multi-view patent dataset: . - 3:
Initialize: Initialize the parameters of the neural network . - 4:
while not converged do - 5:
for do - 6:
feature embedding of BERT neural network; - 7:
evidence from the evidential head according to Equation ( 1); - 8:
end for - 9:
aggregation in terms of Definition 1; - 10:
; - 11:
Obtain the overall loss by updating with Equation ( 9); - 12:
Update the neural networks with a gradient descent according to Equation ( 9); - 13:
end while - 14:
Output: parameters of neural networks. - 15:
/*Test*/ - 16:
Obtain the patent class probability and corresponding uncertainty degree.
|