usage examples as attestations
:fosilno_gorivo frac:attestation attestation_1324567;
attestation_1324567 a frac:Attestation ;
   cito:hasCitedEntity <https://app.sketchengine.eu/#
      dashboard?corpname=user%2FAleksandraTomasevic%2Frudkor> ;
   rdfs:comment "Dokument 31, DK_Monitoring u zivotnoj sredini" ;
   frac:locus :locus_2415677;
  frac:quotation "Koncentracija zagađujućih supstanci, posebno
   onih koje se izdvajaju sagorevanjem fosilnih goriva, varira
   u odnosu na godišnje doba (leto, zima)." .
:locus_2415677 a :Occurrence ;
   nif:beginIndex 80 ;
   nif:endIndex 96.
```
We have just started using VocBench, a web-based, multilingual, collaborative development platform for managing Ontolex-lemon lexicons among other RDF datasets [59], for publishing terminology as RDF data, in order to meet the needs of semantic web and linked data environments. VocBench is an open source web platform for collaborative development of datasets in compliance with Semantic Web standards, offering a general-purpose collaborative environment for development of any type of RDF dataset (with dedicated facilities for ontologies, thesauri and lexicons), including editing capabilities and managing SPARQL endpoint [60]. The system is able to interact with standard technologies in the RDF/Linked Data world, with the possibility to surf linked open data on the Web, access SPARQL endpoints, resolve RDF descriptions through HTTP URIs, and so forth, as well to import/export data through standard Graph Store APIs and the like.

#### *4.3. The Web and Mobile App*

The application for management of Serbian morphological dictionaries, including the evaluation of automatically extracted term candidates used in this approach is Leximirka [61]. Figure 5 presents a web page with term entry *'jalovina'*, where the user can see (1) inflected

forms with grammatical categories, (2) inflectional class (*'N600'*) and dictionary (*'delasim.dic'*); (3) dictionary entries from other dictionaries (digitized and digitally born) grouped by dictionary type (descriptive, terminological, bilingual); (4) related entries (e.g., relational adjectives *'jalovinski'*), lexical variants, derived terms; (5) corpus frequencies; (6) corpus selection with links to concordances and frequency histograms for simple lemma query or predefined syntactic patterns (in figure pattern AN where N is the headword *'jalovina'*), (7) one or more senses with semantic and domain markers.

An important feature of this system is the possibility to insert a formula in the definition, which is often necessary to precisely define a concept. The Figure 6 presents a part of the screen with a latex form of definition and its preview on the same panel. The JavaScript display engine for mathematics MathJax [62,63] that works in all browsers is used in the web application, and KaTeX [64,65] for formula rendering in the mobile application.

**Figure 6.** Formula editing and preview in term entry.

The mobile application allows the user to search for a Serbian or English term, where the query is submitted to the Termi API and a list of entries is retrieved, with a further possibility to request examples for selected entries. Figure 7 presents screenshots of mobile and web applications.

Besides for search, browse and the described export, the application can also be used for preparation of a dataset for Lexonomy [66,67]. Figure 8 presents a panel for term entry editing, which is connected with the Sketch-engine and enables retrieval of examples from a related corpus, in our case the corpus from the mining domain.


**Figure 7.** The mobile and Termi web application data entry preview for term entry.


**Figure 8.** The Lexonomy data entry editing and preview for term entry.

#### **5. Discussion**

The presented approach to the development of terminology for the raw material domain, based on digitized and electronic dictionaries, terminological and domain corpora enables systematic development of terminology, complementing traditional terminological dictionaries with usage examples, and providing a comprehensive picture of the use of terms in various dictionaries, textbooks, professional and scientific literature. A terminology system that includes a relational terminology database, a SPARQL endpoint with linguistic linked open data, on the one hand, and a web and mobile application, on the other, provides a technological solution that enables data management, continuous updating, upgrading and expansion of available data, while various application forms (web and mobile) make the content more accessible to users.

Integration of terminology with the lexical database and morphological dictionaries, which enables support for a complex inflectional system, is important for all languages with rich morphology, such as Serbian. Integration with corpora, both standard and terminological, provides insight into the use of terms in modern language and in a specialized domain, enabling insight into individual examples, but also into the frequency of use of different syntactic structures, enabling research into collocations of individual terms.

The approach is demonstrated on the example of mining, but the same approach and developed software solutions can be used for other areas, which is certainly one of the further directions of activity. It should also be noted that the approach can be applied to other languages, depending on the available data and not on the language itself.

The vast amount of digitized resources, 22 dictionaries, monolingual corpus with 4 million words and bilingual with 12,657 aligned sentences, represent the basis for numerous other research activities, development of collocation dictionaries, creation of possibly printed dictionaries of different volumes (including pocket and encyclopedic ones). Such a system will make it easier for students to translate from English with the use of correct terms in Serbian, but also when writing articles and translating into English for academic purposes.

Since the presented approach used a combination of reuse of data, automatic extraction and manual post-editing, a comparison of those aspects with some similar solutions follows.

When it comes to the reuse of data, we followed the idea of the Sõnaveeb language portal of the Institute of the Estonian Language [10], which contains data from 70 dictionaries and termbases, comprising a total of 200,000 Estonian headwords with many new types of lexicographic information: collocations, etymology, multi-word expressions, and so forth. The number of lexicons in our case is much smaller, but at the moment we are focused on the mining domain and related terminology. Also, our system does not include etymology, but we plan to introduce it in the future. There is a difference in the software solution for mobile users, as Institute of the Estonian Language decided to produce a responsive web page that adapts to different devices by automatically adapting to the screen, whether it is

a desktop, laptop, tablet or smartphone, while we produce a mobile android application akin to Oxford Dictionary or Merriam-Webster. Finally, the difference related to corpus use is that our system has direct connection with corpora, both domain and general language, which allows users to retrieve concordances, collocations defined by syntactic patterns and graphical frequency presentations. The Sõnaveeb project is a result of several projects in a longer period, developed by a much bigger team, but we are following their ideas to continually improve our system.

An Integrated Approach to Biomedical Term Identification Systems [11] combines several sources of information and knowledge bases to provide biomedical term identification systems with modular architecture, which includes medical term identification, retrieval of literature and ontology browsing by applying several NLP technologies. The similarity with our system is in combining several terminological and lexical resources, as well as the use of various NLP techniques, while the difference is that their system generates a conceptual graph that semantically relates all the terms found in the text, which would be our plan for future research. On the other side, our system is building a new resource that integrates a number of digitized and electronic resources.

The corpus-based approach for extracting domain-oriented and technical words applied to improve the efficiency of corpus analysis in COVID-19 big textual data [7] is based on elimination of function words and meaningless words. This, widely accepted, approach for information retrieval is not so successful for knowledge extraction, lexicographic and terminological purposes, so we are relying on a combination of syntactic patterns [34,42,68] and statistical association measures for domain terms: log-likelihood [69], c-value/ncvalue [70], because such hybrid systems have proved to yield the best solutions [71].

Besides monolingual term extraction, we also followed a different approach when it comes to bilingual term extraction [72,73]. We first perform monolingual extraction of domain-specific terms, using available terminology extractors, and then, given a source term and a parallel sentence pair in which it appears, a set of possible translations are obtained. There are different options: to use automatic translation, trained on the same corpus using GIZA++ [40,43], to apply a word aligner [72], or to use log-likelihood comparison and phrase-based statistical machine translation models as in TermFinder [73]. We rely on previous research [27,39,40] that proved successful for bilingual term extraction in other domains, where one language is Serbian.

The Sketch-engine [9] has different types of extraction implemented, for various languages, starting with keyword extraction, word sketches, usage examples, and thesaurus, but it is not fully adapted for Serbian, and its results are far less successful than those obtained in our research [40,68]. Sketch Engine offers tools to significantly speed up the process of dictionary building, especially the "OneClick Dictionary" process, which consists of generating a headword list, providing part-of-speech labels, usage labels, generating candidates for example sentences, collocations, synonyms and thesaurus entries, definitions and/or translations [74]. The output is pushed into the Lexonomy dictionary writing system [66,67], from where lexicographers can communicate with the Sketch Engine during the post-editing phase, enabling browsing of concordances from a corpus and retrieval of selected examples directly into the interface form. The integration with corpus is a rare and very useful possibility, but Lexonomy lacks hierarchy browsing, mathematical formulae are not supported and search capabilities are limited.

#### **6. Conclusions**

The presented approach relies on the results of previous research in the field of NLP and terminology, but represents the first comprehensive solution for both building and using a terminology system that includes data, application and user interface layers covering different data and software technologies.

The automation of data publishing in the form of linked data, as one of the core pillars of the Semantic Web or the Web of Data, provides links between data sets that are understandable not only to humans, but also to machines, by sharing machine-readable interlinked data on the Web.

The next big challenge for the future is the automation of core lexicographic tasks related to semantics, such as finding definitions or identifying senses in two distinct processes: word-sense disambiguation (attributing the correct sense from a predefined set of senses) and word-sense induction (clustering of senses based on word context). Also, integration of results into linked open data especially word embeddings, collocation and similarities.

In future research we will incorporate synonyms for lexical sememe (smallest semantic unit for describing real-world concepts) prediction using an attention-based model [75], which scores candidate sememes from synonyms, by combining distances of words in the embedding vector space, and derives an attention-based strategy to dynamically balance two kinds of knowledge from a synonymous word set and word embedding vector.

**Author Contributions:** Conceptualization, O.K. and R.S.; Data curation, A.T. and L.K.; Formal analysis, R.S. and M.Š.; Investigation, O.K.; Methodology, O.K., R.S. and I.B., Validation, A.T. and L.K.; Software, O.K., M.Š. and I.B.; Writing—all authors. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Finnish Work Environment Fund and Ministry of Education, Science and Technological Development Republic of Serbia within European Science Program SAFERA (European Research on Industrial Safety towards Smart and Sustainable Growth) grant SafePotential, for period 2019–2020. Access to SketchEngine and Lexonomy is provided by the ELEXIS project funded by the European Union's Horizon 2020 research and innovation programme under grant number 731015. Linked data development is supported by the COST Action CA18209- NexusLinguarum "European network for Web-centred linguistic data science".

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are freely available for the search on site: https://termi.rgf.bg.ac.rs/ (accessed on 15 March 2021) and on request from the corresponding author.

**Acknowledgments:** The authors thank Ivan Obradovi´c for proofreading and constructive comments, Cvetana Krstev for use of electronic dictionary of Serbian, Petar Popovi´c for corpus management and Branislava Šandrih for feature extraction from usage examples.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**

