In this Section, we first describe the details and the features of the datasets used in the experimental assessment. Then, we illustrate the tools required to implement the proposed methods and the hardware used in the experiments. Then, an overview of the multilabel measures is provided, distinguishing flat and hierarchical measures. Finally, the obtained results are shown and discussed to try to investigate the behavior of the proposed HDNN in improving hierarchical classification when applied to the PubMed large-scale indexing XMTC problem.
4.1. Dataset
The experimental assessments have been performed on a subset of PubMed abstracts. We exploited the big data architecture described in [
35] to download and extract from PubMed a collection of
abstracts, including the corresponding titles and their MeSH labels. Each document is tagged with a variable number of MeSHs, which currently counts more than 30,000 different labels. We also processed the labels of each document before applying HLSE (see
Section 3.2) to complete the hierarchical label set of each document. It is important to underline that the same MeSH could be located in more branches of the tree, making the number of nodes of the tree higher than the total MeSH number, as shown in
Table 1, obtaining 57,859 nodes after the use of the HLSE on the dataset. The automatic classification of a text collection with the above-described features is a hierarchical multiclass multilabel problem, as defined in
Section 1. The dataset has been divided into a training set and a test set, respectively, randomly selecting
and
of the documents.
To simplify this complex problem, we decided to consider a maximum tree depth equal to five, substituting all the MeSHs that belong to a deeper level of the original hierarchy with the corresponding ancestor of the fifth level. In this way, we obtained 23,255 different tree nodes, distributed as shown in
Figure 4, where it is possible to observe that a very high number of nodes have few children and very few nodes have more than 90 children. We evaluated the average number of children per node to be 5.18, with a standard deviation of 7.30.
Table 1 summarizes the details of the datasets, showing the number of nodes, the average number of labels per document, and the average number of documents per label, and it compares these parameters considering a tree depth reduction to five and the original dataset with all levels, applying in both cases the HLSE.
The automatic classification of PubMed articles is included in the tasks of BioASQ (
http://bioasq.org/ (accessed on 5 December 2023)), a research challenge on biomedical semantic indexing and question answering. A discussion of the experimental results can provide further details for improving the BioAsq results [
13] and solving large XMTC problems.
4.3. Tools and Hardware
Focusing on the tools used for the implementation and execution of the experimental assessment, the preprocessing phase (tokenization and sentence splitting) was performed using Apache Spark and Stanford CoreNLP wrappers for Spark (
https://github.com/databricks/spark-corenlp (accessed on 5 December 2023)). The obtained NLP preprocessed text is the input of the word embedding layers, which converts the words into a vectorial representation. For this purpose, we used the Gensim Python library [
36] version 4.2 in the case of static word embeddings, while we adopted the Hugging Face Python library (
https://huggingface.co/ (accessed on 5 December 2023)) to obtain dynamic embeddings from the BioBERT model, providing a whole sentence as input.
The proposed HLSE and HDNN algorithms were also implemented in Python. In detail, we exploited Keras (
https://keras.io (accessed on 5 December 2023)) version 2.11, with Tensorflow 2 backend for the definition, training, and testing of the HDNN. Moreover, the implementation requires the pyTree library, which provides a list-derived tree structure for Python, which is exploited to represent the label hierarchy. Finally, the Numpy, Pandas, and Pickle libraries were required to run the code. We underline that the code is publicly available on SoBigData research infrastructure (the code is publicly available at
https://data.d4science.net/RgXa (accessed on 5 December 2023)).
All the experiments were performed on a deep learning cluster based on an IBM Power9 architecture, where each node was equipped with 2 Power9 CPUs at 3.7 GHz, 512 GB of RAM and 4 Nvidia Tesla V100 GPUs with 16 GB of dedicated VRAM. The operating system is Red Hat Enterprise Linux Server release 7.6. The current implementation of the HDNN runs on a single node and on a single GPU (the parallel implementation is not ready yet).
4.4. Evaluation Metrics
Different evaluation measures for multilabel text classification were proposed in the literature. These measures can be grouped into two classes:
flat and
hierarchical [
37]. The flat measures base evaluates each single result obtained for each label in terms of Precision
, Recall
, and F1 score
for each class
, which equals
where
,
, and
are, respectively, the true positives, false positives, and false negatives of the class
. These values are then averaged, obtaining the measures for the text classification task. The microaverage, which gives equal weight to each per-document classification result [
38], is obtained by averaging over all labels each member of the fraction of the previous equations, respectively, obtaining the micro-Precision
, micro-Recall
, and micro-F1
, defined as
with
L equal to the total class number.
To complete the flat-measure-based evaluation, Accuracy
must be considered, which is calculated as
where
are the true negatives for the class
.
Another flat measure is the example-based metrics, which evaluate bipartitions calculated on the average differences in the true label and the predicted label sets over all examples of the evaluation dataset. If we have a dataset with
M multilabel examples and if
,
is the set of true labels for each example and
is the set of predicted labels for
ith example, the example-based Precision, Recall, and F1 score, respectively, indicated as
,
, and
, are defined as
The metric estimates how many predicted labels for the ith example are correct, while the estimates the number of assigned labels correctly retrieved; combines both measures for a global evaluation.
Furthermore, when the classification problem has a hierarchically organized label set, it is also possible to adopt the hierarchical measures that consider the label hierarchy in the classification results, taking into account the full path of the exact concepts in the class hierarchy and measuring whether parts of the path have been correctly classified. The definition of hierarchical measures is based on the augmented set of the true and predicted classes, respectively,
and
, which are equal to
where
and
are the ancestors of the true and predicted classes.
Starting from them, it is possible to define the hierarchical Precision, Recall, and F1 score (called
,
, and
). These measures consider the classifications result of a single document as a subtree path in the hierarchy as
The and identify two subtrees, respectively, representing the path of the true classifications and the path of the predicted classifications. The intersection of these two subtree paths is their similarity. Therefore, the is the ratio between the subtree intersection and the predicted classifications paths, and the is the ratio between the intersection and true classifications paths.
Adding all the ancestors in the sets, as in the previous case, overestimates the error if the hierarchy tree has nodes with many ancestors. To account for this behavior, the LCA-based measures [
13] were defined. These kinds of measures are based on the
Lowest Common Ancestor (LCA) concept from the graph theory [
39]. The
of two nodes,
and
, of a tree T is the node in T the furthest from the root that is an ancestor of both
and
. To obtain LCA-based measures, the LCA concerning the set of predicted classes
for each class
y in the set of true classes
Y, called
, and LCA for the set of true classes
Y for each class
in the set of predicted classes
, named
, must be calculated as
where
is the distance between the nodes
u and
v in the tree. Now, it is possible to define two graphs
and
, respectively, formed by the shortest paths from each
to
and the shortest path from each
to
. Then, the set
formed by all the nodes in the graph
and the set
, which contains all the nodes in the graph
, must be constructed. Finally, LCA Precision
, LCA Recall
, and LCA F1 score
are defined as
In our experiments, whose main purpose is the contribution of the proposed HDNN to the incorporation of the label hierarchy knowledge directly into the neural network, the information provided by the hierarchical measures is very important.
4.5. Results and Discussion
In order to investigate the capability of the HDNN in incorporating the hierarchical label knowledge into the neural network, we compared it with a classic CNN architecture, such as the one described in [
40], where the same CNN layers of the HDNN are implemented but organized in a flat topology. In detail, considering that we simplified the problem using at least the fifth level of the MeSH tree in our experimental assessment, as described in
Section 4.1, each internal node corresponding to an element of the hierarchy is formed by a cascade of two DNN layers (CNN and dense layers) for the two preprocessing CNN layers in the
Initial Feature Processing block (see
Figure 1). In this case, it is possible to estimate 17 layers for the deepest tree branch and 4 layers for the flat CNN architecture.
Table 4 and
Table 5, respectively, show the flat and hierarchical measures obtained with a simple CNN architecture and the proposed HDNN approach, using static (WE) or dynamic word embeddings (BioBERT). As expected, it is possible to note a performance improvement in the case of the HDNN, in both terms of flat and hierarchical measures. Moreover, the use of dynamic BioBERT-based embedding provides further improvement compared with the classic static word embedding approach in the first layer of the neural network. In more detail, the MiF and the EBF scores obtained by HDNN are, respectively, improved by 24.7% and 30.8% in the case of static WE and by 23.7% and 35.8% using BioBERT embeddings. Moreover, the increment in the HiF score is equal to 19.8% for HDNN WE and 35.9% for HDNN BioBERT. The higher impact of the HDNN is obtained in terms of LCA-based hierarchical measures: the LCa-F improves by 58.6% when static WEs are used and by 65.2% when exploiting the BioBERT model to encode input text.
In summary, the HDNN obtains improvements in overall performances in terms of all the considered measures (with the slight exception of HiP for HDNN WE, probably caused by the overestimation of the error provided by this measure in the case of nodes with many ancestors). More importantly, the results show a significant increment in the case of LCA-based hierarchical measures. The information related to this score considers the correct classification of the label hierarchy, giving a higher weight in the case of the assignment of a label that lies in the same hierarchy path. This behavior demonstrates that the HDNN is able to assign more correct labels with respect to a flat CNN architecture, but in many cases, other labels, although not originally included in the manual annotation, could be a reasonable assignment. The next examples can better clarify this result.
Referring to
Figure 3, if a document related to cardiac edema is tagged with the label
Heart Failure (the higher-level label in the hierarchy path), a flat measure will consider this result completely wrong, while the hierarchical measure will suggest that the result is not perfect but a good result. For our purposes, this information is very important because an improvement in the hierarchical measures shows that the neural network has improved the correct classification of the labels along the hierarchy path, demonstrating that some information about the hierarchy has been embedded within the HDNN.
More in detail, considering as an example a paper from the dataset, the article Myocardial Edema: A Translational View has been manually tagged with several MeSHs by NLM human experts (for simplicity, we focus on a single example). Among them, there is the label Cell Enlargement [G07.345.249.410.500]. The flat DNN architectures were not able to correctly predict it. On the other hand, the HDNN BioBERT, although it also did not find this label, identified a set of ancestors of this MeSH, namely Physiological Phenomena [G07], Growth and Development [G07.345], Growth [G07.345.249], and Cell Growth Processes [G07.345.249.410], which is the father node of the correct label. Analyzing the content of the paper, all these labels would not be considered wrong by a human annotator, as they are related to the topics of the scientific article. More importantly, the Cell Growth Processes [G07.345.249.410] MeSH is semantically very similar to its child Cell Enlargement [G07.345.249.410.500]. In summary, the HDNN approach suggested a set of possible correct candidate labels, and, moreover, these labels lie in the same tree of the correct MeSH, providing, as expected, an increment in hierarchical scores.
The obtained results demonstrate that the HDNN can support the integration of innate information and prior knowledge into a DNN and can be used to improve the performance in complicated tasks like XMTC.
On the other hand, a limitation of the proposed approach is related to its high computational time. The training time required in our experiments is approximately three days for each model, while the prediction on the test set and the calculation of the evaluation measure need approximately 15 min. The flat CNN model requires approximately 1 h to be trained on the same dataset and a few minutes for the prediction task. More in detail, we measured the training times of both flat CNN and HdNN approaches, observing in the case of HDNN 15,000 s for each epoch formed by 100,000 document samples, whereas the CNN training time for the same epoch is about 100 s, using the hardware described in
Section 4.1. The HDNN training time is very high, especially if compared with the other flat CNN architecture, although both network topologies are characterized by a comparable number of parameters: 40,079,550 in the case of CNN and 45,194,480 in the case of HDNN. The very high training time of the HDDN is caused by the complexity of the graph topology, where a large number of interconnections among all the nodes are involved in reproducing the label tree structure. Moreover, each internal node of the HDNN has to wait for the outputs of its ancestor nodes before proceeding to execute forward and backward passes during the Stochastic Gradient Descent of the training of the model.
We compared the training time of a CNN and an HDNN with an increasing depth of labels in the training set for a deeper analysis of this behavior. We observed a comparable training time for CNN and HDNN in the case of labels belonging only to the first layer of the hierarchy. On the other hand, the training time of the HDNN increases in an exponential way at each deeper level of the hierarchy we included in the training set, while the CNN does not show substantial increases in training time.