*4.3. Evaluation*

For evaluation, five documents from a pool of existing publications on model-tests were used for an initial performance test of the pipeline:


The documents differentiate in length and format, for instance number of columns. Furthermore, Doc#5 is kind of defect since the PDF contains invisible text overlaps. Thereby, the three modules were evaluated separately depending on the evaluation aim. Since semantic annotation is not a common task within the domain of tribology, standard test

documents, which are widely accepted for performance measurement of NLP tasks, do not exist. Therefore, the test documents were manually annotated for the special purpose of evaluating the pipeline introduced within this contribution. The document extraction module was analyzed with respect to its quality in extracting and separating text and other elements like figures and tables from the PDF documents. This resulted in a comparison of the data from the extraction module against the ground truth (GT) for text, figures and tables. This shows if the system works as it is intended. The smaller the deviations from GT, the more reliable is the PDF extraction. Within Table 1, the reliability of text extraction is assessed against the criteria, if chapters, paragraphs, sentences, words and chars are correctly detected and separated. The deviations from GT are relatively small for Doc#1 (chapter −12.5%; paragraph −8.6%, sentence −7.3%, word −7.3%, char −9.8%), Doc#2 (chapter 37.5%; paragraph −26.7%, sentence −3.0%, word −10.0%, char −18.2%, Doc#3 (chapter 7.1%; paragraph 11.5%, sentence −2.4%, word −1.3%, char −5.4%) and Doc#4 (chapter 27.3%; paragraph −22.2%, sentence 23.6%, word −5.0%, char −10.9%), while the deviation is substantially higher for Doc#5 (chapter 240%; paragraph 400%, sentence 149%, word 132%, char 118%). Those high deviations can be attributed to the defect PDF, which contains embedded textual and other elements, which overlap the intended content of the document. The results of the figure extraction analysis are shown in Table 2. Almost all figures within the test set were correctly detected. Only one figure was partly incorrect extracted in Doc#2 and two figure areas were incorrectly recognized in Doc#4. However, all figures were correctly extracted within the defect PDF Doc#5. Thereby, 14 additional figures were identified, which is due to the overlayed elements within the PDF. The extraction of tables seems also reliable since the majority of tables are correctly recognized (see Table 3). An exception is within Doc#1, which can be attributed to the table being rotated within the publication. This shows that the first module depends on the quality and regularity of the input files. Since the module provides a manual check, small deviations from the expected output can easily be fixed via the GUI.

**Table 1.** Quality of text extraction regarding the extracted chapters, paragraphs, sentences, words, chars and if an abstract was detected (true/false). The GT is given in the brackets.


**Table 2.** Quality of figure extraction regarding detected figures, incorrectly detected figure area and additional extractions. The GT is given in the brackets.


**Table 3.** Quality of tables extraction regarding detected tables, additional extractions and correct number of cells. The GT is given in the brackets.


The document annotation module was evaluated with respect to the ability of NER and knowledge object generation. Three language models (BERT, SciBERT and SpanBERT) were trained with the hyperparameters shown in Table 4, which were an outcome of a previously conducted parameter study. Thereby, an RNN (recurrent neuronal network) architecture was used with one layer and a hidden size of 128. Dropout [72,73] is a method to reduce overfitting by deactivating a number of neurons randomly from the neural network. The learning rate defines the step size of the optimization and thus controls how quickly the model learns the given problem. The batch size specifies the number simultaneously evaluated examples. Since the used language models have already been pre-trained on large-scale general language data (cf. Section 2.4), the training includes only fine-tuning, which is computationally less expensive. The training of the models took about 20 to 30 min each on a NVDIA RTX 2070 and 8 GB RAM.


**Table 4.** Hyperparameters for training language models.

Micro and Macro F1 scores were calculated to select the best of the three models for the recognition task. Therefore, the five documents were manually annotated due to the tribological annotation model categories. For every category the precision, recall and F1 score were calculated three times for each of the trained language models with regard to the manual annotations (see Table 5). The test set contained 986 annotated sentences for the tribological annotation model categories already introduced in Figure 10.


**Table 5.** Precision (P), Recall (R) and F1 score for each tribological annotation model category.


**Table 5.** *Cont.*

The resulting F1 scores are summarized in Table 6 for BERT, SciBERT and SpanBERT, which were each calculated in triplicate. As mentioned before, SpanBERT featured the best scores within the second run, which may be due to the annotated entities referring to the tribological categories, that are often spans of words instead of single tokens (e.g., "Scanning electron microscopy").

**Table 6.** Evaluation and selection of the NER model. F1 scores for BERT, SciBERT and SpanBERT.


The annotations generated through NER were further aggregated to knowledge objects within the document analysis module. The resulting number of aggregations is shown in Table 7. Annotations are considered incorrectly aggregated if at least two annotations are assigned to the same knowledge objects, although they do not belong together (false positive). Furthermore, if at least two annotations which belong to a knowledge object are not aggregated, they count as false negative. This criterion captures the reductivity of the knowledge object generation while the counts of correctly and incorrectly aggregated annotations provide an insight into the precision of the generation. A precision of 89.5% is reached for the test pool while the recall is about 84.4%. This can be considered as sufficient for the quality of knowledge object generation.

**Table 7.** Evaluation of knowledge object generation containing the number of annotations, of all knowledge objects as well as correctly aggregative (true positive), incorrectly aggregated (false positive) and not aggregated (false negative) knowledge objects.


Finally, the document analysis module was evaluated due to its quality of answering the questions from the templates. The criteria for assessing the quality were grouped to the quality of question answering itself and if the decision maker prefers the right answer. The final results over all questions are shown in Table 8. The GT is counted, if at least one answer within the text can be given to the question. The criteria for question answering itself are split into the cases if the expected answer is found in text and/or if at least one additional answer was found independent of the expected answer. The need for the decision maker can be seen from the fact that additional answers besides the expected one were found for all documents. The criteria for the decision maker were thereby split into the cases if the correct answer was preferred by the decision maker, if an incorrect answer was preferred, or if no answer was found or preferred. When the text contains at least one correct answer (GT), the question answering itself found the correct answer with a probability of 60.4%. The decision maker found the correct answer with a probability of 62.3%. At this point it should be noted that the quality of answers is highly influenced by the question templates. This means what questions are asked of the publication to get a desired answer.

**Table 8.** Evaluation results of all answers to question templates regarding the input parameters (e.g., kinematical parameters), structural information of the tribological system (e.g., geometry) and output parameters (e.g., friction and wear).

