*2.3. Ontologies, Knowledge Graphs and Semantic Annotation*

As Gruber [41] states, within ontologies, definitions associate names of the entities within a universe of discourse. Therefore, the schema provided by an ontology can be shared among different knowledge graphs, which hold the actual data. This is referred to as ontological commitment and is a guarantee for consistency, even for incomplete knowledge, since there are agreements to use a shared vocabulary [41]. Those commitments to a specific vocabulary (or terminology) are also implicitly made within natural language communication. Within the domain of tribology, they exist for instance for the description of a tribological system (cf. Figure 1). Since tribological testing should enable reproduceable and comparable results, experiments must be built upon a common methodology, which defines the system structure as well as input- and output parameters. Describing experiments as well as results within a scientific publication under the use of a common terminology is a first step of knowledge formalization (Figure 3).

**Figure 3.** Different degrees of formalization from natural language text to logical constraints. Redrawn and adapted from [42].

Nevertheless, the challenge with sharing knowledge among natural language publications is the vague or even insufficient description, since often knowledge about the domain theory is assumed to be present to the human reader. An example is the following

description from the materials section of an experimental study on Ti3C2T*<sup>x</sup>* nanosheets (MXenes) [43–45] investigated as solid lubricant for machine elements [46]:

*"Commercially available thrust ball bearings 51201 according to ISO 104 [* ... *] consisting of shaft washer, housing washer and ball cage assembly were used as substrates (Figure 1a)."*

With background knowledge of the tribological domain, it becomes clear, that the studied tribological system here is a certain thrust ball bearing and with the information given by the referenced figure, the coated parts can be identified by a human reader. However, within the textual description, the link to the underlying methodology is not stated explicitly. Thus, the documentation of the experiment is incomplete and ambiguous from a formalization perspective. In other words, a semantic gap between textual descriptions from publications and the general knowledge models with a higher degree of formalization (Figure 3) prevents machine-supported processing of existing tribological knowledge from publications. In order to bridge such a gap, semantic annotation is the process of joining natural language and formal semantic models (e.g., an ontology) [47]. A semantic annotation of the example cited above associated with ontological concepts from the tribAIn ontology [29] is shown in Figure 4. In this example, the string "thrust ball bearings 51201" is recognized as referring to the instance *tbb\_51201*, which is a tribological system ("*tbb\_51201 a tAI:TriboSystem*" in the triple notation of Figure 4). Furthermore, the components of the thrust ball bearing are referred to the instances *sw\_51201* (shaft washer), *hw\_51201* (housing washer) and *as\_bc\_51201* (ball cage assembly) and are annotated as parts of the tribological system within the triple notation. Annotating text snippets semantically to instances of an ontology enriches the natural language text with machine-readable context. For example, the instance "*tbb\_51201*" may not only be referred to the experimental testing described in the publication, but also be linked within the knowledge graph to information from the ISO 104 mentioned within the text snippet. Therefore, the semantic annotation process links mentions of entities from different sources to knowledge objects within a knowledge graph, which are further semantically defined by an ontological schema.

**Figure 4.** Example of a semantic annotation of a text excerpt from [46] with concepts from the tribAIn ontology [29] graphically visualized and in triple notation (Turtle format).

Furthermore, some semantic annotation systems perform ontology population, which means not only annotating documents with respect to an existing ontology resulting in semantic documents but creating new instances from the textual source [47]. For example, the ball bearing from the example above is instantiated as a new knowledge object within a knowledge graph. One advantage of building knowledge graphs from textual sources is the direct link between mentions of knowledge objects within a source and the capability of generating structured data from those mentions, even if the facts about a knowledge object origin from different sources. A schematic architecture of a semantic knowledge base, which consists of a domain ontology on schema-level, as well as a knowledge graph that holds the data about knowledge objects, is shown in Figure 5. An example of structured

information is given for the knowledge object "thrust ball bearings", once as its use in a tribological test setup and once as a rolling bearing with its specification.

**Figure 5.** Schematic architecture of a semantic knowledge base, consisting of schema-level ontologies and a knowledge graph containing knowledge objects as structured data and mentions of knowledge objects from semantic documents.

Thus, different information from various sources is linked for an object of the knowledge graph. Moreover, the original textual sources are also linked nodes within the graph. Semantic annotation can be performed manually, semi-automatically, or automatically. Thereby, the semi-automatic approach is preferred since manual annotation is time-consuming and automatic approaches can lead to unreliable information within the resulting knowledge graph [47].

## *2.4. Natural Language Processing*

A semi-automatically semantic annotation process is often conducted by methods from NLP. The main challenge of NLP is the representation of contextual nuances of human language, since the same matter can be described utilizing different wording and the same word can be used for different meanings depending on the context. Therefore, enabling machines to understand and process natural language demands the provision of a machine-readable model of language. However, Goldberg [48] describes a challenging paradox in this context: Humans are excellent in producing and understanding language and are capable to express and interpret strongly elaborated and nuanced meaning of language. In contrast, humans struggle at formally understanding and describing the rules, which govern our language [48]. Rules in this context are not only referred to syntax and grammar, but also to contextual concerns. For example, considering a classic NLP task of document classification into one of the four categories metals, fluids, ceramics, or polymers. Human readers categorize documents relatively easy into those topics guided by the words used within a publication but writing up those implicitly applied rules for categorization is rather challenging [48]. Therefore, machine learning models are trained to learn vectorized text representations from examples, which are suited input formats for NLP downstream tasks (e.g., document classification). The classic preprocessing steps for generating those text representations from a document corpus are summarized in Figure 6.

Almost any analysis of natural language starts with splitting the documents (e.g., plain text, charts, figures), removing noise (e.g., references, punctuation) and normalization of word forms [49]. Subsequently, the plain text is further split into minimal entities of textual representation, the tokens, on word- or character-level. Since ML models assume some kind of numerical representation as input, the tokens are replaced by their corresponding IDs [50]. If a text is split into tokens on word level, the question arises, what counts as a word. To answer this question, morphology deals with word structures and the minimal units a word is built from, such as stems, prefixes and suffixes. Those minimal units are important, if a tokenizer has to deal with unknown words (meaning words, which were not within the training corpus) [48]. Tokenizers like WordPiece [51] represent words as subword vectors [49], e.g., "nanosheets" can be separated in the subwords "nano" and "sheet" and the plural-ending "- s". However, the tokens are further transferred in so-called embeddings, which are an input representation a ML or deep learning architecture can handle for NLP tasks. An embedding is a representation of the meaning of a word; thus, they are learned under the premise, that a word with the same meaning has a similar vector representation [49]. A distinction is made between static embeddings and contextualized embeddings. One quite popular static word embedding package is Word2Vec [52,53]. A shortcoming of those static embeddings is that polysemantic is not properly handled since one fixed representation is learned for each word in the vocabulary even if a word has a different meaning in different contexts [49,54].

**Figure 6.** Preprocessing steps to generate embeddings from text as input to NLP downstream task. Redrawn and adapted from [50].

Therefore, contextualized (dynamic) embeddings provide different representations of each word based on other words within the sentence. State of the art representatives are ELMO (Embeddings from Language Models) [55], GPT & GPT2 (Generative Pre-Training) [56] and BERT (Bidirectional Encoder Representations from Transformers) [33], which are also referred as pre-trained language models. BERT is a multi-layer bidirectional transformer encoder [33,57], which is provided in a base version with 12 layers and large version with 24 layers. Most of the recent models for NLP tasks are pre-trained on language modeling (unsupervised) and fine-tuned (supervised) with task-dependent labeled data [58]. Thus, those models are trained to predict the probability of a word occurring in a given context [48]. BERT is pretrained on large amount of general-purpose texts from

BooksCorpus and English Wikipedia, which resulted in a training corpus of about 3300 M words [33]. Devlin et al. [33] differentiate BERTs pre-training from the other mentioned models, consisting of two unsupervised tasks: masked language modeling (LM) and next sentence prediction (NSP) (see also [59] for further information on BERTs pre-training). Fine-tuning BERT for downstream tasks, like Question Answering (QA) or Named Entity Recognition (NER), the same architecture is used apart from the output layer (see Figure 7). The input layer consists of the tokens (Tok 1...Tok n). The special token [CLS] signs the starting point of every input and [SEP] is a special separator token. For instance, question answering pairs can thus be separated within the input [33]. The contextual embeddings (E1 ... En) further result in the final output (T1 ... Tn), after being computed through every layer resulting in different intermediate representations (Trm). For more information on Transformer architectures, the interested reader is referred to [60]. There are different extensions of the original model of BERT, which are specialized for certain downstream tasks or domain terminologies. The SciBERT model [61] is pre-trained on scientific papers improving the performance of downstream tasks with scientific vocabulary. BioBERT [32] is pre-trained on large-scale biomedical corpora and improves the performance of BERT especially in biomedical NER, relation extraction and QA. Furthermore, SpanBERT [62] is a pre-training approach, which is focused on a representation of text spans instead of single tokens. Both pre-training tasks from the original BERT are adapted for predicting text spans instead of tokens, which is especially useful in relation extraction or QA.

**Figure 7.** BERT pre-training and fine-tuning procedures using the same architecture for both. Only the output layer differs depending on the downstream task e.g., NER, QA. Redrawn and adapted from [33].
