*3.2. Document Annotation*

The document annotation module (Figure 9) performs the actual annotation process. The module expects the files in JSON format from the document extraction module as input and is capable of annotating plain text and table data. The annotation module uses the Flair library [66] and the embeddings of the NER model are trained on the tribological annotation categories displayed in Figure 10. Within a parameter study, embeddings from BERT (Base), SciBERT and SpanBERT were tested against each other. Thereby, SpanBERT

were chosen, since those have shown the best results with an F1 (micro) score of 0.8065 and an F1 (macro) score of 0.8012.


**Figure 10.** Example of semantic annotation and knowledge object generation within the document annotation module and annotation categories for example sentences from [67].

The annotation step within the module recognizes entities of the tribological categories. An example is shown in Figure 10. The inputs are three different sentences (plain text), which are parsed. Then, entities of the different categories are annotated. In a second step, the annotations are aggregated to knowledge objects, thus for instance the two recognized entities MXene and Ti3C2T*<sup>x</sup>* refer to the same knowledge object (Figure 10). Due to the knowledge object generation, different terms used to describe the same entity within a text are aggregated to a single object. The generation of knowledge objects is mainly based on identifying acronyms and synonyms. The identified character strains are then compared. For character strains, which go beyond four, a fuzzy comparison using the Fuzzy-Wuzzy Library (https://pypi.org/project/fuzzywuzzy/, accessed on 14 December 2021) is conducted, which calculates the Levensthein distance to compare two-character strains. Output data from the document annotation module is again streamlined in JSON format and contains the annotated text and table data as well as the aggregated knowledge objects.

## *3.3. Document Analysis*

The document analysis module (Figure 11) is a QA system that extracts answers from the text to questions about tribological model tests to create triples from the document. The QA system is built on the PyTorch framework (https://pytorch.org/, accessed on 14 December 2021) using a SciBERT-Model from the Hugging Face library (https://huggingface.co/, accessed on 14 December 2021).

**Figure 11.** Document analysis module.

The BERT model is fine-tuned by question-answer pairs. Question templates (Figure 12) are generated, which contain the questions for extracting knowledge objects from the text. Those templates determine the structure of the information, which should be extracted from the text. This means, the question templates can be customized depending on the extraction task. The decision maker is an intermediate aggregation step containing multiple redundancies, which ensures higher reliability of an extracted answer. Therefore, the question template contains the same question rephrased several times. Furthermore, the answer space is restricted by using regular expressions (Regex) to define an expected answer pattern and by specifying an entity type (tribological category) of the extracted answer. The final result is an ID of a knowledge object and its textual annotation, for the case a knowledge object can be assigned. Otherwise, the textual passage is extracted as answer to the question.


**Figure 12.** Question template example for the extraction of the testing duration of an experiment.
