*Article* **A Provable and Secure Patient Electronic Health Record Fair Exchange Scheme for Health Information Systems**

**Ming-Te Chen † and Tsung-Hung Lin \*,†**

Department of Computer Science and Information Engineering, National Chin-Yi University of Technology, Taichung 41170, Taiwan; mtchen@ncut.edu.tw

**\*** Correspondence: duke@ncut.edu.tw

† These authors contributed equally to this work.

**Abstract:** In recent years, several hospitals have begun using health information systems to maintain electronic health records (EHRs) for each patient. Traditionally, when a patient visits a new hospital for the first time, the hospital's help desk asks them to fill in relevant personal information on a piece of paper and verifies their identity on the spot. This patient will find that many of her personal electronic records are in many hospital's health information systems that she visited in the past, and each EHR in these hospital's information systems cannot be accessed or shared between these hospitals. This is inconvenient because this patient will again have to provide their personal information. This is time-consuming and not practical. Therefore, in this paper, we propose a practical and provable patient EHR fair exchange scheme for each patient. In this scheme, each patient can securely delegate the information system of a current hospital to a hospital certification authority (HCA) to apply migration evidence that can be used to transfer their EHR to another hospital. The delegated system can also establish a session key with other hospital systems for later data transmission, and each patient can protect their anonymity with the help of the HCA. Additionally, we also provide formal security proofs for forward secrecy and functional comparisons with other schemes.

**Keywords:** electronic health records; fair exchange; forward secrecy

#### **1. Introduction**

In recent years, many research topics have arisen to make human life more convenient. An electronic health record (EHR) is an integrated personal medical record in health information systems. Many countries implement their own health information systems to help manage each patient's activities and keep track of their health. We can imagine a scenario in which a patient (let us call her Alice) plans to go to a new hospital and sees a doctor. In this situation, she may have to fill in her personal medical information another time when she attends a new hospital. In addition, if her doctor needs to know her medical treatment history from other hospitals, how she provides these records securely to her doctor needs to be considered. These problems are especially urgent. Our proposed scheme ensures the ease and security of data access and migration. Our approach proposes a practical and provable patient EHR fair exchange scheme with session key establishment for health information systems. Patients cannot only delegate the migration of their personal EHR to a desired hospital system from their current hospital health information system but also protect their privacy. Our mechanism provides secure data storage and the secure transfer of authorized information to a designated location. This study has two limitations. First, we assume that each patient's EHR record is well defined and appropriate for each healthcare facility. The process of electronic health information record transmission at each hospital provider is easily done by implementing our proposed scheme for secure encrypted transmission without considering issues such as different forms or file names or a lack of formatting details. Second, each facility will transfer or link the patient's EHR to

**Citation:** Chen, M.-T.; Lin, T.-H. A Provable and Secure Patient Electronic Health Record Fair Exchange Scheme for Health Information Systems. *Appl. Sci.* **2021**, *11*, 2401. https://doi.org/10.3390/ app11052401

Academic Editor: Federico Divina

Received: 25 January 2021 Accepted: 2 March 2021 Published: 8 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

other facilities through patient consent or under a national policy when the patient requires better care at those facilities.

Summarizing all problems, we propose a high-level practical and provable patient EHR fair exchange scheme with key agreement for health information systems. Not only could a patient delegate the health information systems of the current hospital to migrate their personal EHR to the desired hospital system, but they can also keep their privacy. Our mechanism provides data storage and the secure transfer of authorized information to designated locations. What information can be authorized is beyond the scope of this study to determine. For example, whether COVID-19 patient privacy concerning patients' names, identities, and genetic sequences can be transmitted to different hospitals is beyond this study's scope. The mechanism presented here could guarantee data transfer and storage safely and securely. What is more, our scheme also provides a formal security proof in a random oracle model under chosen-ciphertext security.

This paper is organized as follows. Section 2 introduces related works. Section 3 deploys security definitions. Section 4 shows our proposed method. Section 5 describes our security analysis. Section 6 provides a security proof. Finally, Section 7 presents our conclusions.

#### **2. Related Works**

In this section, we surveyed some articles [1–4]. In [1], the author only mentioned how EHRs are used and managed. The author also talked about the EHR format that followed the definition of HL7 [5] and that performed well-known protocols to encode each patient's EHR from TCP/IP, MIME, HTTP(S), and SOAP.

In [2], the authors discussed several security requirements, such as EHR storage security, malicious code prevention, protected access right management, and other aspects to protect the health information system. However, they did not provide a practical scheme that would allow a patient to migrate their EHR to a health information system. We can imagine a scenario where a hospital only adopts the above simple protocols to develop its own health information system without any security mechanism. Additionally, it is not feasible for each patient to perform their own EHR exchange under this scheme.

On the other hand, in [5], the authors suggested that each patient's health records (or files) could be portably stored on a flash disk. This idea is appealing but is currently difficult to implement. There are many security issues to be handled, including portable device security and patient medical file access rights. However, more security mechanisms are needed to solve these kinds of security issues, which are beyond the scope of this research.

In addition, various patient authentication schemes of e-health systems have been proposed [6–10]. In [6,7], the schemes suffered a user impersonation attack and did not offer session key establishment with a formal security proof. The authors in [8–10] did not provide session key establishment with a forward secrecy proof. In [11,12], the authors each proposed a framework with a patient-centric access right in a blockchain environment. However, they did not provide a practical mechanism for each patient to perform EHR migration exchange securely.

Additionally, many studies are now examining the importance of personal privacy and data authorization. For example, the prevalence of COVID-19 has made many patients reluctant to disclose information about their infection, but government healthcare units want to control the trajectory of tracking these patients. A method of providing improvements in these mechanisms is the main motivation and purpose of our study. Therefore, in this paper, we emphasize providing a secure, simple, and complete mechanism for authorizing data transfer during personal information migration and demonstrate that our approach is secure and effective in practice through a professional information security authentication model.

Hence, we summarize and list here seven kinds of security attack when a patient attempts to migrate their personal information data through a traditional authentication model:


Our contribution is to offer an efficient provable and practical patient EHR fair exchange scheme so that each patient can migrate their personal EHR securely from one hospital to another and provides solutions to the above seven problems. We designed a secure patient EHR exchange protocol that can be integrated into the e-health information system of each hospital. The proposed scheme could also guarantee convenience, rapidity, and integrity. We constructed a high-level practical and provable patient EHR fair exchange scheme with key agreement for the health information system. A patient could not only delegate the current hospital's health information systems to migrate their personal EHR to the desired hospital system, but also keep their privacy. Additionally, our scheme demonstrates a formal security proof with light-weight computation for both authentication parties.

#### **3. The Proposed Scheme**

Our proposed scheme contains three stages: the migration registration phase, the EHR migration phase, and the data recovery phase.

#### *3.1. Preliminary*

In this subsection, we provide some definitions in our proposed scheme.


#### *3.2. The Migration Registration Phase*

Before starting this phase, a patient (*U*) forwards (*UIDU*, *IDU*) to a hospital certification authority (*HCA*) for migration certification registration with a secure channel. After receiving this identity (*UIDU*, *IDU*), the HCA keeps this link information and generates *Certi* certification with *IDi* for EHR migration and forwards this *Certi* to the patient *U*.

When *U* performs this phase with the server *V* of the current hospital, *U* forwards a patient migration registration request to a server (*V*). After receiving this request, the server *V* forwards this request to the *HCA* to help *U* obtain the permission signature from *HCA*. *V* first prepares two hash functions: one is *H*1, and the other is *H*2, where *H*<sup>1</sup> : *Z*∗ *<sup>n</sup>* → {0, 1}*<sup>l</sup>* and *H*<sup>2</sup> : *Z*∗ *<sup>n</sup>* → {0, 1}*<sup>l</sup>* .


• *V* receives this message tuple, where one is (*SHCA*, *H*1((*rV* + 1)|*rHCA*)) and the other is ((*rV* + 1) ⊕ *rHCA*, *Date*, *IDU*, *SU*), and it can verify the signature *SHCA* with the above parameters and compute *rHCA* = (*rV* + 1) ⊕ (*rV* + 1) ⊕ *rHCA*. When the above messages are valid, *V* returns *H*1(*rHCA* + 1) and forwards it back to the server *HCA*. In addition, it also generates a signature *SigV*(*SHCA*, *SU*, *AgreeV*−>*W*) as the receipt *SV* and finishes this phase after forwarding *SV* and *SHCA*. We demonstrate in the Figure 1.

**Figure 1.** The migration registration phase.

#### *3.3. The EHR Migration Phase*

In this phase, the server *V* will behave according to the delegated agreement file *AgreeV*−>*<sup>W</sup>* with the signature *SU*, and it prepares these messages as follows.


**Figure 2.** The EHR migration phase.

#### *3.4. The Data Recovery Phase*

In this phase, if there is some network packet loss or EHR data loss of the patient *U* after they have performed the EHR migration phase, then the e-health system of the hospital *W* cannot obtain the full *U*'s personal EHR, so *U* can ask the *HCA* to deal with this situation.


#### **4. Security Assumptions**

#### *4.1. Secure Digital Signature*

In this scheme, we define a secure digital signature. In the beginning, we have that *Sig*(·) is a signature generation function that inputs a message *m* with a signer's secret key *ski* and outputs a signature *Si*. We also assert this signature function is based on the RSA factoring hard problem or the discrete logarithm problem. We can then input a signature such as *Si* with the signer's public key *pki* into the verification function *Ver*(·) and see what the output is. If the output is 1, then we can confirm the signature *Si* is valid and signed by the signer *i*. In this scheme, we also assumed that the signature building block is under the RSA problem. If there is an attacker, we assume it as F∗. If F<sup>∗</sup> can make a forged *l* + 1 signature called *S <sup>i</sup>*,*j*+<sup>1</sup> of some user *i* ∈ {*U*, *V*, *HCA*} in at most *l*signature queries, and this signature can pass the verification *Ver*(*pki*, *S <sup>i</sup>*,*j*+1) successfully with non-negligible probability *ε*, then F<sup>∗</sup> can be used to break the RSA factoring problem. Thus,

$$\Pr\left[S\_{i,j+1}^{\prime} \leftarrow \mathcal{F}^\*(\text{Sig}(\text{sk}\_{i\prime} \cdot), \text{Ver}(pk\_{i\prime} \cdot), i \in \{\text{l}I, V, H\text{CA}\}) \middle| \Pr(S\_{i,j+1}^{\prime}) = 1\right] \geq \varepsilon. \tag{1}$$

#### *4.2. Unforgeability*

In this scheme, we define the secure digital signature scheme in the above. First, we define an attacker F∗, whose ability is to forge a signature that can be verified successfully through the *Ver*(·) verification function with non-negligible probability *ε* . We also define a simulator D that adopts F∗'s ability to break the underlying hard problem (such as the RSA factoring problem) in the above secure signature scheme. After D is given the environment parameters *G*(·), it can start the protocol simulation with F∗. F<sup>∗</sup> can make the signature queries to the D. D will also output the signature back according to the received input *m* from F<sup>∗</sup> on some user *i*. After this simulation, if F<sup>∗</sup> generates a forged signature *S <sup>i</sup>*,*j*+1, the verification result of *S <sup>i</sup>*,*j*+<sup>1</sup> is valid. We then have

$$\Pr\left[\mathcal{D}^{\mathcal{F}^\* \longrightarrow S\_{i,j+1}'} | \text{Use } S\_{i,j+1}' \text{ to solve the RSA factoring problem} \right] \ge \varepsilon \quad \text{(2)}$$

$$\Pr\left[S\_{i,j+1}' \longleftarrow \mathcal{F}^\*(\text{Sig}(sk\_{i\prime}.\cdot), \text{Ver}(pk\_{i\prime}.\cdot), i \in \{\mathcal{U}, V, H\mathcal{C}A\}) \,|\,\text{Ver}(pk\_{i\prime}.S\_{i,j+1}') = 1\right] \ge \varepsilon'.$$

In fact, if there is no attack F<sup>∗</sup> that can make a forged signature pass the verification successfully with non-negligible probability *ε*, then we cannot use F<sup>∗</sup> to solve the RSA factoring problem with non-negligible *ε* probability.

**Lemma 1** (Unforgeability)**.** *First, we define Sig, which is a secure digital signature function and equips two secure hash functions, H*<sup>1</sup> *and H*2*, which can be replaced with two random oracles functions RO*<sup>1</sup> *and RO*2*. In our proposed EHR scheme, we also define our proposed EHR scheme with unforgeability (Unf), which satisfies the following situations. In other words, if Sig is* (*t* ,*ε* ) *and unforgeable, then*

$$Adv\_{\mathcal{F}^\*, \text{Sig}^{H\_1 H\_2, H\_3, R\lambda\_2}}^{\text{Luf}}(\theta, t') \le \frac{1}{2 \cdot I^3 \cdot q\_s} + \varepsilon',\tag{3}$$

*where t is the maximum total experiment time, including an adversary execution time, I is an upper bound on the number of parties, at most signature oracle qs, and ε is taken over the coin flip of our EHR scheme.*

#### *4.3. Indistinguishability*

We define an attacker A on the experiment **EXP** of our symmetric encryption/decryption functions (**SE**), which is a game controlled by the simulator S. We also define two pseudorandom hash functions (*ω*<sup>1</sup> and *ω*2), which are satisfied with the property we call "indistinguishability" (**Ind**), due to which the attacker A can make a hash query to *ω*<sup>1</sup> and *ω*<sup>2</sup> on the message *M* , which is chosen by A. These functions act as real functions as our hash functions (*H*<sup>1</sup> and *H*2), where *i* ∈ {*U*, *V*}. The simulator also can switch this function pair

to respond to each query made by A during the simulation rounds of the above experiment. Finally, the simulator S is given a challenge message target *M* chosen by the A.

At this time, S makes a coin flip on *b*. If *b* = 0, S randomly chooses (*ω*1, *ω*2) to generate the hashed value of *M* and return it to A. Otherwise, S forwards *M* to (*H*1, *H*2) to ask for the hash value. A's goal is to guess correctly the hashed value that is from (*ω*1, *ω*2) or (*H*1, *H*2) with non-negligibility probability.

**Lemma 2** (Indistinguishability)**.** *In this lemma, our symmetric encryption/decryption functions satisfy the indistinguishably property if there is no attacker* A *that can guess the hashed value from the chosen (M) with more than* <sup>1</sup> <sup>2</sup> *with negligible probability ε*<sup>∗</sup> *under the t* ∗ *polynomial time bound. That is,*

$$|\Pr[b' \longleftarrow \mathcal{F}^{(\omega\_1, \omega\_2, H\_1, H\_2)}(M) | b = b'] - \frac{1}{2}| \le \varepsilon. $$

*Therefore, we concluded that*

$$Adv\_{\mathcal{A}, \mathcal{SE}}^{\mathrm{Ind}}(\theta, t^\*) \le \frac{1}{2} + \varepsilon^\*.$$

#### *4.4. Indistinguishable-Chosen Cipher-Text Attack (Ind-CCA)*

In this scheme, we define our proposed asymmetric encryption/decryption function (**ASE**), which satisfies the semantic security in the following definitions.

First, we define an attacker A that can ask encryption/decryption queries in our scheme, respectively. However, the attacker A can also make an encryption query to the chosen message that we define as *M* . The attacker A can then also make a decryption query to the decryption oracle, whose task is to decrypt the cipher-text sent A. Next, we define *Game*, which is the simulation of our proposed scheme that can equip many different oracles, and oracles can answer back to the adversary depending on the attacker's input messages. We also define some oracles, such as the encryption oracle *AEpkT* (·, *θ*) with the security parameter *θ*. This encryption oracle can generate the ciphertext according to the received input *Mb*, where *b* ∈ {0, 1}. In addition, we also model the decryption oracle that receives the cipher-text *C* from the attacker A and returns the final decrypted message *M* to the attacker A. In the following, we consider two situations involving A.

**Phase 1**: In this phase, the attacker A can make the decryption and encryption queries on a chosen message (call it *M* ). I.e., if A makes an encryption query on the input message *M* , then *C* ←− *AEpkT* (*M* , *θ*) returns to A. At this time, A can also make the decryption query on cipher-text *C* , and the simulator will then forward this *C* to the decryption oracle and return the final message *M* back to the A. Additionally, A can also make other kinds of queries, such as a hash query to the hash oracles.

**Challenge**: In this phase, if A has performed training on the above encryption/ decryption query many times, then, in the following challenge phase, the attacker A will choose a challenge message pair (*M*∗ <sup>0</sup> , *M*<sup>∗</sup> <sup>1</sup> ) for the simulator for game playing. The simulator then will toss the coin on *b* after it receives this message pair. If the final output *b* is 1, then we can have *C*<sup>∗</sup> ←− *AEpkT* (*M*<sup>∗</sup> *<sup>b</sup>* , *θ*). Otherwise, we have *C*<sup>∗</sup> ←− *AEpkT* (*M*<sup>∗</sup> <sup>1</sup>−*b*, *<sup>θ</sup>*). After the attacker A has asked the cipher-text on the chosen target messages (*M*<sup>∗</sup> <sup>0</sup> , *M*<sup>∗</sup> 1 ), the only restriction is that the A cannot ask the decryption oracle on the target message (*M*∗ <sup>0</sup> , *M*<sup>∗</sup> <sup>1</sup> ) with the input cipher-text *C*∗. This query can make the simulation fail due to the simulator cannot be able to tell the answer of cipher-text *C*∗. Except in the above query, A can make other kinds of queries on different messages.

**Lemma 3.** *In this lemma, we model the above actions as the game simulation steps, which we played with the attacker* A*.*

```
GameInd−CCA−b
     A,ASE (θ)
Phase 1.
T ∈ {U, V}, {M0, M1} ←− AASEpkT (·,θ),ASDskT (·,θ),H1(·),H2(·)
Challenge Phase.
b ∈ {0, 1}, C∗ ←− ASEpkT (M∗
                              b , θ),
b	 ←− AASEpkT (·,θ)
                   (C∗, M∗
                          0 , M∗
                               1 )
Return b	
         .
```
The advantage function of the adversary that <sup>A</sup>*Ind*−*CCA ASE* (·, *<sup>θ</sup>*)is defined as *AdvInd*−*CCA* <sup>A</sup>,*ASE* (*θ*) = <sup>|</sup>*Pr*[*GameInd*−*CCA*−<sup>1</sup> <sup>A</sup>,*ASE* (*θ*) = <sup>1</sup>] <sup>−</sup> *Pr*[*GameInd*−*CCA*−<sup>0</sup> <sup>A</sup>,*ASE* (*θ*) = <sup>1</sup>]<sup>|</sup> <sup>&</sup>lt; <sup>1</sup> <sup>2</sup> <sup>|</sup>*Pr*[*GameInd*−*CCA*−<sup>1</sup> <sup>A</sup>,*ASE* (*θ*) = 1]| ≤ *ε* .

#### *4.5. Partner Function*

In this definition, we define the partner function. We assume that there is an instance Π*<sup>k</sup> <sup>i</sup>* whose action is the same as player *i* in the *k*-th session, where *i*, *j* ∈ {*U*, *V*} and *k* ∈ *N*, where *N* is the number for total players. Let the partner function be the instance of player *j* (call it Π*<sup>k</sup> <sup>j</sup>* ) in the *k* -th session, where *i*, *j* ∈ {*U*, *V*} and *k* ∈ *N*. At this time, the instances Π*<sup>k</sup> <sup>i</sup>* and <sup>Π</sup>*<sup>k</sup> <sup>j</sup>* believe that each side is the real player *i*, *j* ∈ {*U*, *V*} in the *k*, *k* ∈ *N* session, respectively. At this time, we can say that two instances Π*<sup>k</sup> <sup>i</sup>* and <sup>Π</sup>*<sup>k</sup> <sup>j</sup>* are partnered if the following statements are true:


#### *4.6. Freshness*

In this definition, we define freshness. We assume that there is an instance where Π*<sup>k</sup> i* is "fresh" if it satisfies the following conditions.


#### *4.7. Forward Secrecy (FS)*

Our proposed two factor patient authentication scheme is forward secrecy (FS) if A cannot compromise the past information, even if they have sent *Corrupt*(*i*) (or *Corrupt*(*j*)) to the player *i*, where *i*, *j* ∈ {*U*, *V*}.

**Theorem 1.** *First, we assume that ASE is an indistinguishable-CCA (Ind-CCA) secure asymmetric encryption/decryption scheme and equips two secure hash functions, H*<sup>1</sup> *and H*2*, which we can be replaced with two random oracle (RO) functions, respectively. We also assume that our proposed patient electronic health record exchange scheme (PEHRES) that is forward secure (FS) and unforgeable (Unf) also satisfies the following situations. In other words, if our proposed scheme is secure, then*

$$\begin{split} \operatorname{Adv}\_{\operatorname{PEHRES}}^{\operatorname{FS,Indr},\operatorname{Indr}-\operatorname{CCA}}(\theta,t) &\leq \frac{1}{2} \big( \operatorname{I}^{2} q\_{\operatorname{h}} q\_{\operatorname{c}} q\_{\operatorname{s}} (\operatorname{Adv}\_{\operatorname{ASE},\operatorname{\mathcal{D}},\operatorname{\mathcal{C}}\_{\operatorname{MCA}}^{\operatorname{L}}}^{\operatorname{Indr}-\operatorname{C}CA}(\theta,t')) + 1 \big) + \\ &\frac{1}{2} (\operatorname{I}^{2} q\_{\operatorname{h}} q\_{\operatorname{c}} (\operatorname{Adv}\_{\operatorname{ASE},\operatorname{\mathcal{D}},\operatorname{\mathcal{C}}\_{\operatorname{V}}^{\operatorname{L}}}^{\operatorname{Indr}-\operatorname{C}CA}(\theta,t')) + 1) + \\ \frac{1}{2} ((\operatorname{I} q\_{\operatorname{h}})^{2} \operatorname{Adv}\_{\operatorname{A},\operatorname{S}\operatorname{E}}^{\operatorname{Indr}}(\theta,t^{\ast}) + 1) + (\operatorname{I}^{3} q\_{\operatorname{s}}) \operatorname{Adv}\_{\operatorname{S},\operatorname{\mathcal{S}},\operatorname{\mathcal{F}}}^{\operatorname{Indr}}(\theta,t^{\ast}) + \operatorname{\mathfrak{e}},t \leq t' + t^{\ast}, \end{split} \tag{4}$$

*where t is the total execution time, t is the maximum total experiment time including an adversary execution time, t* ∗ *is the maximum total time to guess the real session key, I is an upper bound on the number of parties, with at most qe encryption queries at most qs decryption oracles, and qh is an upper bound on the number of H*<sup>1</sup> *and H*<sup>2</sup> *queries in the experiment, where ε is a negligible advantage.*

#### **5. Security Analysis**

In this section, we provide security analysis and functional analysis of our proposed scheme.

#### *5.1. Replay Attack Resistance*

In this EHR migration phase, we adopt random values *r <sup>U</sup>*, *r*<sup>∗</sup> *<sup>V</sup>*, and *r*<sup>∗</sup> *<sup>W</sup>* as our authentication challenge numbers. We assume an attacker can capture authentication messages among the protocol communication and may replay these captured messages to the server *W* to impersonate the patient *U*. First, the server *V* will check that this message was used before in some session before communicating with the server *W*. Hence, the server *V* will also check that one of these messages *r <sup>U</sup>*, *r*<sup>∗</sup> *<sup>V</sup>*, and *r*<sup>∗</sup> *<sup>W</sup>* was used before. If one of them was used, then it would close this session and save the record as the replay attack from *V*.

#### *5.2. Resist User Impersonation Attack*

In this proposed scheme, the adversary cannot replay any authentication message without the user *U*'s biometric information *BioU*, and it also cannot guess the random number *r <sup>U</sup>* successfully to impersonate the server *V*. Additionally, the adversary does not have the non-negligible probability to forge the patient's signature to the server *V*. In addition, the server *V* also checks the signature *SU* to authenticate the patient *U*'s identity in the migration registration phase. Thus, the adversary cannot have non-negligible probability to forge *U*'s signature *SU* under the RSA factoring problem. Therefore, our scheme can resist user impersonation attacks.

#### *5.3. Provide Mutual Authentication*

In the EHR migration phase, a patient *U* can delegate the server *V* to perform the EHR migration exchange with the system of the desired hospital *W*. Server *V* can perform the challenge response with the server of *W*, and they both communicate a session key for later usage after successful authentication. During the authentication rounds, *V* and *W* can check the freshness of random numbers (*r <sup>U</sup>*, *r*<sup>∗</sup> *<sup>V</sup>*, and *r*<sup>∗</sup> *<sup>W</sup>*). If one of them is to be replayed, *V* or *W* would find out and deny this session with the other party. Finally, it would close this phase and record that there was a replay attack in this EHR migration phase.

#### *5.4. Provide Data Security*

In the EHR migration phase, all random numbers are generated by these two parties and drop off when the authentication between them is successful. In addition, not only are *r <sup>U</sup>*, *r*<sup>∗</sup> *<sup>V</sup>*, and *r*<sup>∗</sup> *<sup>W</sup>* verified by these two parties *V* and *W*, but also they can also be response messages to confirm their respective identities.Hence, the adversary cannot have a nonnegligible probability to replace each of these messages to pass the authentication process. In the data recovery phase, *r <sup>U</sup>* is used to encrypt the patient's EHR, and the adversary does not have a non-negligible probability to obtain a patient's EHR, under the assumption that

the symmetric encryption/decryption function is indistinguishable for the adversary in a polynomial time bound.

#### *5.5. Session Key Establishment*

In the EHR migration phase, the server *V* and the server *W* can also communicate a common session key after they perform challenge-response authentication with each other successfully. Not only can this session key be used for later communication, but it can also provide for symmetric encryption/decryption usage. In the appendix, we provide a formal security proof of the session key.

#### *5.6. Forward Secrecy Proof*

In the EHR migration phase, a patient can delegate the server *V* to authenticate with the desired server of hospital *W*. They can then build the session key after successful authentication. In fact, they can use this session key to communicate with each other to transfer the patient *U*'s EHRs or update the patient *U*'s EHR. With this property, the system can reduce the communication bits and improve the efficiency of data transmission. In the appendix, we also provide a formal secrecy proof of the session key.

#### *5.7. EHR Fair Exchange*

In the EHR migration phase, if *W* does not receive the *U*'s EHR from the *V* or if *U*'s EHR is broken, then the patient *U* can perform the data recovery request to the *HCA* and ask the *HCA* for help to solve this situation by providing the above signatures and *V*'s receipt to *HCA*. If the above signatures are valid, *HCA* performs the data recovery phase and forwards the encrypted patient's EHR to the system of the hospital *W*. Finally, the server of hospital *W* can also obtain the patient *U*'s EHR under the help of *HCA*.

#### *5.8. Offline Trusted Third Party*

In the proposed scheme, we assume that there is a *HCA* and that it generates the patient's EHR migrating signature with a delegation document and performs data recovery. Here we can assume that the on-line device of the *HCA* can generate the signature after verifying the request party's signature in the migration registration phase. Only if there is a request coming in the data recovery phase would the *HCA* be on-line and solve this situation after verifying the request party's evidence, including the registration signatures and the related signatures. From the above setting, our trusted third party would not stay on-line all the time and just appears when it is needed. Additionally, only the *HCA* knows the link information (*UIDU*, *IDU*) of the patient *U*. Therefore, the patient can prevent their real identity from being disclosed during the EHR migration transaction.

From the above security analysis properties, we take [10] as a reference and make comparisons with schemes from [6–10]. In the following, we provide some security analysis definitions for security comparison (Table 1).


**Table 1.** Security comparison.

A1: Replay Attack Resistance; A2: Resist User Impersonation Attack; A3: Provide Mutual Authentication, A4: Provide Data Security; A5: Session Key Establishment, A6: Forward Secrecy Proof; A7: EHR Fair Exchange, A8: Offline Trusted Third Party.

#### *5.9. Efficiency Comparisons*

In this section, we evaluate our proposed scheme's efficiency. First, we assume that our scheme's parameter *p* is of 1024 bits for security consideration. We assume that H is the computation time of one hashing operation, *Exp* is the computation time of one modular exponential operation in a 1024 bit module, *M* is the computation time of one modular multiplication in a 1024 bit module, *ECM* is the computation time of a number over an elliptic curve, and *ECP* is the computation time of a bilinear pairing operation of two elements over an elliptic curve in [13–15]. We also let *Sig*, *ASE*, *ADE*, *SE*, and *SD* be the signature operation time, the asymmetric encryption time, the asymmetric decryption time, the symmetric encryption time, and the symmetric decryption time, respectively. We assume that our proposed scheme can be implemented on an elliptic curve over a 163-bit field and has the same security level of a 1024 bit public key crypto-system such as RSA or the Diffile-Hellman cryptosystem. We also assume that *Exp* = 8.24*ECM* for the ARM CPU to the processor in 200 Mhz [15]. We also determine certain relations from the following: *Exp* ≈ 240*M* = 600*H* ≈ 3*ECP*, and *ECA* ≈ 5*M* in [16–22].

Based on [23], a public key encryption/decryption operation time in an elliptic curve is approximately 1*ECA* and 1*ECM* + 1*ECA*, respectively. Therefore, our proposed scheme total computation time cost is about 9*H* + 3*Sig* + 14 ⊕ +2*ASE* + 1*ADE* ≈ 60.075*M* + 14⊕. Due to the different properties of the above schemes, we omitted the efficiency comparisons and found some currently survey papers [11,24] that have the same functional properties as our proposed scheme.

In [11], the authors proposed a dynamic consent model of health data sharing using blockchain technology. They combine the consent representation models (DUO) and ADA-M [24] to let patients control their EHR sharing to match the request query with full access rights. Their method is designed for building an EHR platform but is not a practical mechanism for patients exchanging their EHRs with a formal security proof in a blockchain environment. In [12], the authors proposed an EHR with a patient-centric access right framework model by using blockchain technology. We think that this is a good idea for building health information exchange systematic modules with blockchain in the future, but they do not offer a practical solution for EHR migration currently, even in a blockchain environment. Our proposed scheme is established by the functional block such as the signature functions with other authentication functions. In future work, our proposed scheme could functionally add a smart contract function to generate a verifiable functional patient EHR block in blockchain network. Hence, our proposed scheme could be used in blockchain and non-blockchain environments.

In the efficiency evaluation of our scheme, we used a desktop with Ubuntu 20.04 with Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz CPU and 15 GB memory. The simulation

experiment was carried out using GO language, and the standard "crypto/elliptic" library was used. We simulated every phase 20 times, shown in Figures 4–6.

In the future, we will discuss the forged HCA problem [25] and other applications such as neural network environments for COVID-19 patients [26] exchanging their EHRs. We hope to have a good solution to the above problems.

**Figure 4.** The migration registration phase simulation.

**Figure 5.** The EHR migration phase simulation.

**Figure 6.** The data recovery phase simulation.

#### **6. Security Proof**

In this section, we continue to demonstrate what an adversary is and its probability. We model the *Game*, our scheme simulation steps, and the related oracle responses.

An adversary (call it S) can control all communication messages in this scheme. The adversary can obtain related information by sending oracles. A *Game* is the simulation of our proposed scheme, which can equip all kinds of oracles, and oracles can reply back according to the adversary's questions. There is also another adversary (call it S) that controls the simulation and takes A's ability to break the hard problem defined in the security definition.

Let *Game* be a "game", the simulation of our scheme, where the adversary A can ask queries to the oracles, and the oracles can answer back to the adversary. The following are query types that an adversary can make in the game.


**Proof of Theorem 1.** First, we assume that there is an adversary A that attempts to attack our patient EHR exchange scheme (*PEHRES*) in the forward secure sense. We then let *dis* be the event at which A can distinguish at least one ciphertext in *PEHRES* with non-negligible probability. At the same time, we also let *f orge* be the event at which the adversary D can forge the signature of our *PES* with non-negligible probability. We assume that

$$\Pr\_{\mathcal{A}}[b = b'] \le \Pr\_{\mathcal{A}}[b = b' \land \overline{\operatorname{dis}} \land \overline{\operatorname{forg\mathcal{e}}}] + \Pr\_{\mathcal{A}}[\operatorname{dis}] + \Pr\_{\mathcal{A}}[\operatorname{forg\mathcal{e}}]\_{\mathcal{A}}$$

where *b* and *b* are coin flips chosen by the simulator and the attacker A, respectively. We also assume that

$$\Pr\_{\mathcal{F}^\*} [\text{força}] \le \Pr\_{\mathcal{F}^\*} [\mathcal{F}^\* \to \mathcal{S}\_{\mathcal{U}}^\* | \text{Ver}(\mathcal{S}\_{\mathcal{U}}^\*) = 1] + \Pr\_{\mathcal{F}^\*} [\mathcal{F}^\* \to \mathcal{S}\_{\text{HCA}}^\* | \text{Ver}(\mathcal{S}\_{\text{HCA}}^\*) = 1]\_{\text{//}}$$

where *S*∗ *<sup>U</sup>* and *S*<sup>∗</sup> *HCA* are signatures forged by the attacker F∗, respectively. We then use three lemmas to complete this security proof in the following.

**Lemma 4.** *We assume that there is no event such that the attacker* A *can distinguish the ciphertext C*∗ *with non-negligible probability*

$$\Pr\_{\mathcal{A}}[dis] \le \frac{1}{2} (I^2 q\_h q\_\ell q\_s (Adv^{\text{Ind-CCA}}\_{ASE, \mathcal{D}, \mathcal{C}\_{\text{HCA}}}(\theta, t')) + 1) + \frac{1}{2} (I^2 q\_h q\_\ell (Adv^{\text{Ind-CCA}}\_{ASE, \mathcal{D}, \mathcal{C}\_V}(\theta, t')) + 1),$$

*in the polynomial time bound t under the above Ind-CCA security definition with qh hash queries, at most qe encryption queries, and at most qs decryption queries, respectively.*

**Proof of Lemma 4.** We assume that *Pr*[*dis*] is a non-negligible probability in the simulation game. We can then construct an attacker D whose work is to distinguish the cipher-text under the Ind-CCA encryption/decryption scheme. There is also an attacker F whose goal is to break the encryption/decryption of our proposed scheme *SE*. Next, we construct D as the simulator that simulates the attacking environment in which F can mount its attack. First, D simulates an encryption oracle *SEpki* (·, *θ*), where *i* ∈ {*U*, *V*}, and generates the *C* to the attacker F on the plain-texts (*M* ) chosen by the attacker D in the selected instance Π*<sup>k</sup> <sup>i</sup>*<sup>∗</sup> , where the partner of <sup>Π</sup>*<sup>k</sup> <sup>i</sup>*<sup>∗</sup> is *pj*<sup>∗</sup> . In addition, D also simulates the decryption oracle to answer the decryption query issued by the attacker D. We consider the following steps. First, D prepares all hash functions, including *H*<sup>1</sup> and *H*2, two hash functions with collision-resistance. It also generates the instances *i* ∗, *j* <sup>∗</sup> ←− [1, ..., *I* − 1] of each player *i*, where *i* ∈ {*U*, *V*}. It can make the above two hash queries *l* ∗ times, where *l* <sup>∗</sup> ←− [1, ..., *qh*].

#### **Hash query**

In this hash query phase, the simulator also responds to all kinds of hash queries in each stage.


#### **Phase 1**


#### **Challenge**


Finally, F has a set with instances of players *i* ∗ and *j* <sup>∗</sup> with *qh* total hash queries , at most *qe* encryption queries, and *qs* decryption queries. At this time, D does not fail in the simulation environment with F's correct guessing, where *b* = *b* has non-negligible probability. The following equation will then hold:

$$\begin{aligned} \operatorname{Adv}\_{\operatorname{ASE},\mathcal{D},\mathcal{E}\_{\operatorname{IAC}}^{\operatorname{Iad}}}(\theta,\mathfrak{l}') &\leq \frac{1}{l^{2}q\_{h}q\_{\mathfrak{l}}q\_{\mathfrak{s}}}(Pr[\operatorname{Game}\_{\operatorname{ASE},\mathcal{F}}^{\operatorname{Iad}-\operatorname{CCA}-1}(\theta) = 1] - Pr[\operatorname{Game}\_{\operatorname{ASE},\mathcal{F}}^{\operatorname{Iad}-\operatorname{CCA}-0}(\theta) = 1]) = \\ \frac{1}{l^{2}q\_{h}q\_{\mathfrak{s}}q\_{\mathfrak{s}}}(Pr[\operatorname{Game}\_{\operatorname{ASE},\mathcal{F}}^{\operatorname{Iad}-\operatorname{CCA}-1}(\theta) = 1] - (1 - Pr[\operatorname{Game}\_{\operatorname{ASE},\mathcal{F}}^{\operatorname{Iad}-\operatorname{CCA}-1}(\theta) = 1])) = \\ \frac{1}{l^{2}q\_{h}q\_{\mathfrak{s}}q\_{\mathfrak{s}}}(2(Pr[\operatorname{Game}\_{\operatorname{ASE},\mathcal{F}}^{\operatorname{Iad}-\operatorname{CCA}-1}(\theta) = 1]) - 1). \end{aligned} \tag{5}$$

In the ciphertext *C*∗ *<sup>V</sup>* simulation game, we have the same simulation as above. Therefore, we omitted the simulation, but we also conclude that

$$Adv\_{ASE, \mathcal{D}, \mathcal{C}\_V}^{\text{Ind}-\mathcal{C}CA}(\theta, t') \le \frac{1}{I^2 q\_h q\_\ell} (2 (Pr[Game\_{ASE, \mathcal{F}}^{\text{Ind}-\mathcal{C}CA-1}(\theta) = 1]) - 1) \tag{6}$$

We then can summarize the total probability as follows:

$$\begin{aligned} Pr\_{\mathcal{A}}[dis] &\le Pr[Game\_{ASE,\mathcal{F}}^{Ind-\text{CCA}-1}(\theta) = 1] \le \frac{1}{2} (I^2 q\_h q\_{\mathcal{C}}(Adv\_{\mathcal{A}}^{Ind-\text{CCA}}(\theta, l')) + 1) + \frac{1}{2} (I^2 q\_h q\_{\mathcal{C}}(Adv\_{\mathcal{A}\mathcal{E},\mathcal{D},\mathcal{C}\_V^\*}(\theta, l')) + 1). \end{aligned} \tag{7}$$

**Lemma 5.** *Before we prove this lemma, we assume that there is no attacker* A *that can guess the real session key in the event that the ciphertext C*∗ *generated by the symmetric encryption (SE) functions cannot be distinguished by* A *correctly with non-negligible probability. We then have*

$$\Pr\_{\mathcal{A}}[b = b' \land \overline{\text{dis}} \land \overline{\text{forge}}] \le \frac{1}{2} ((Iq\_h)^2 \text{Adv}\_{\mathcal{A}, \text{SE}}^{\text{Ind}}(\theta, t^\*) + 1),$$

*in the polynomial time t*<sup>∗</sup> *under the random oracle (RO) assumption with total qh hash queries.*

**Proof of Lemma 5.** In this proof, we construct another simulator C that also simulates the attacking environment for A mounting its attack. Finally, if A can guess the real session key successfully with the non-negligible property, then we can use A to break the random oracle assumption.


#### **Hash Query**

In this hash query phase, the simulator can answer all kinds of harsh queries in each stage, as follows:


Finally, if <sup>C</sup> answers the *Test* query for <sup>Π</sup>*t*<sup>1</sup> *<sup>i</sup>*<sup>∗</sup> and <sup>Π</sup>*t*<sup>2</sup> *<sup>j</sup>*<sup>∗</sup> by using (*Z*<sup>∗</sup> *<sup>n</sup>*, *H*1, *H*2, *ω*1, *ω*2), and A does not fail in guessing *b* , then A answers the session key depending on its coin flip *b* . We can have

$$\begin{aligned} Adv\_{\mathcal{C},\mathcal{A},\mathcal{S}}^{\mathcal{U}\_{1},\mathcal{U}\_{2},\omega\_{1},\omega\_{2}}(\theta,t) &= \\ \Pr[\mathcal{C}(\mathcal{Z}\_{n}^{\*},H\_{1},H\_{2},\omega\_{1},\omega\_{2}) = 1 | \text{ssk}\_{i,j} = H\_{1}((r\_{W}^{\*}+1) | (r\_{V}^{\*}+1)))] - \\ \Pr[\mathcal{C}(\mathcal{Z}\_{n}^{\*},H\_{1},H\_{2},\omega\_{1},\omega\_{2}) = 1 | \text{ssk}\_{i,j} &\longleftarrow \{0,1\}^{\*}, t \in \mathcal{Z}\_{q}^{\*}] \leq \end{aligned} \tag{8}$$

1 (*Iqh*)<sup>2</sup> (*Pr*[A(·) = <sup>1</sup>|*sski*∗,*j*<sup>∗</sup> is real in *Test* query] <sup>−</sup> *Pr*[A(·) = <sup>1</sup>|*sski*∗,*j*<sup>∗</sup> is random in *Test* query]) <sup>≤</sup>

$$\frac{1}{(Iq\_h)^2} (2Pr\_{\mathcal{A}}[b=b' \land \overline{dis} \land \overline{foreg}]-1).$$

Finally, we could conclude that

$$\Pr\_{\mathcal{A}}[b = b' \land \overline{\text{dis}} \land \overline{\text{forge}}] \le \frac{1}{2} ((Iq\_h)^2 \land d\upsilon\_{\mathcal{C}, \mathcal{A}, \text{SE}}^{H\_1, H\_2, \omega\_1, \omega\_2}(\theta, t) + 1). \tag{9}$$

**Lemma 6.** *Before we prove Lemma 1, we assume that there is no event such that the attacker* F<sup>∗</sup> *can forge the signature SU of patient U with non-negligible probability*

$$\Pr\_{\mathcal{F}}[\mathit{forg}e] \le (I^3 q\_s (\mathit{Adv}\_{\mathit{Sig}, \mathcal{S}, \mathcal{F}}^{\mathit{Unif}}(\theta, t^\*)))$$

*in the polynomial time bound t* <sup>∗</sup> *under the above Ind-CCA security definition with qh hash queries, at most qe encryption queries, and at most qs decryption queries, respectively.*

**Proof of Lemma 6.** In this lemma proof, we start to prove our above **Lemma** 1 (Unforgeability). To start the proof of **Lemma** 1 (Unforgeability), we defined the **Game** as the simulation game that runs as the proposed protocol controlled by the simulator S. We define **Game** as follows.

$$\begin{array}{l} \mathbf{Game}\_{A,S\_{i\bar{\mathcal{S}}}}^{\operatorname{dir}}(\theta,t) \\ \mathbf{Phase\ 1}, \\ \mathcal{F} \leftarrow \{M\}^{l} \\ S\_{i} \leftarrow \{\mathcal{F}^{S\_{i\bar{\mathcal{S}}}(\operatorname{sk}\_{i},M\_{1}),RO\_{1},RO\_{2},H\_{1},H\_{2}\}(\theta,t) \\ \mathbf{Chas\ 1}\mathbf{Hales\ PRase.} \\ i \in \{\mathcal{U},HCA\}, M^{\*} \leftarrow \mathcal{F}(M), \\ \text{Loop } j=1 \text{ to } l \\ S\_{i,j}^{'} \leftarrow \mathcal{F}^{S i\_{\bar{\mathcal{S}}}(\operatorname{sk}\_{i})}(\theta,t)(M^{\*}) \\ \text{If } (\forall \text{er}(S\_{i,j+1}^{'}) \ \!= 1 \text{ and } S\_{i,j+1}^{'} \notin S\_{i,j}). \\ \text{Break} \\ \text{Return } S\_{i,j+1}^{'}. \\ \text{else if } (j$$

We first define the simulator S as the simulator that is given in the RSA factoring problem, and we assume there is an attacker F whose goal is to forge a valid signature on the *Sig* function block.

The simulator <sup>S</sup> first chooses the security parameter *<sup>l</sup>* with the message space *<sup>M</sup><sup>l</sup>* . The S also selects two collision-resistance hash functions that map from *Z*<sup>∗</sup> *<sup>n</sup>* −→ {0, 1} and two hash oracles *RO*<sup>1</sup> and *RO*2, respectively. After setting up the system parameter, the S simulates each phase in the proposed EHR scheme. In the migration registration phase, the attacker F can impersonate the patient *U* to ask for *U*'s signature request *SU* on the desired message. When S has received this request, it takes the message as input and outputs the signature *SU* with the help of the above secure digital signature function *Sig*. It then returns this *SU* back to the F. The F can also continue to ask the hospital certification center *HCA*'s signature on the received signature *SHCA* of patient *P*. It also receives the message tuple (*SU*, *Date*, *IDU*) and outputs the signature *SHCA* back to F. In addition, it is the same situation when F asks the signature of *V*. The S also returns *SV* back to F.

In these phases, the F will make the signature request in the above situation. The S starts the **Challenge phase** and forwards the message *Mi*, where *i* ∈ {*U*, *V*, *HCA*}. The F can forge *l*+1 signatures *S i*,*j* , where *S <sup>i</sup>*,*j*+<sup>1</sup> ∈/ *Si*,*<sup>j</sup>* ←− *Sig*(*ski*, *M*∗), where *i* ∈ {*U*, *V*, *HCA*} and *j* = 1 ∼ *l* after *l* signature queries. Finally, this forged signature also passes the verification *Ver*(*S <sup>i</sup>*,*j*+1) successfully. We then can use F 's ability to find a solution to the RSA factoring problem. Thus, we have

$$\begin{split} \operatorname{Adv}\_{\operatorname{Sig},\mathcal{S},\mathcal{F}}^{\operatorname{Inf}}(\theta,t^{\*}) &\geq |\operatorname{Pr}[\mathcal{S}'\_{i,j+1} \gets \mathcal{F}\_{\operatorname{Sig}}^{\operatorname{Inf}}(\mathcal{M}^{\*}), \operatorname{Ver}(\mathcal{S}'\_{i,j+1} = 1)]| \\ &= \frac{1}{I^{3}q\_{\operatorname{s}}} (\operatorname{Pr}[\mathcal{S}'\_{i,j+1} \gets \mathcal{F}\_{\operatorname{Sig}}^{\operatorname{Inf}}(\mathcal{M}^{\*}), \operatorname{Ver}(\mathcal{S}'\_{i,j+1} = 1)]). \end{split} \tag{10}$$

Finally, we could conclude that

$$\Pr\_{\mathcal{F}}[\mathit{forg}e] \le (I^3 q\_s) \mathit{Adv}\_{\mathit{Sig}, \mathcal{S}, \mathcal{F}}^{\mathit{Infif}}(\theta, t^\*).$$

After summarizing the above three lemmas, we can conclude that 1 <sup>2</sup> (*I*<sup>2</sup>*qhqeqs*(*AdvInd*−*CCA SE*,D,*C*<sup>∗</sup> *HCA* (*θ*, *t* )) + 1) + <sup>1</sup> <sup>2</sup> (*I*<sup>2</sup>*qhqe*(*AdvInd*−*CCA SE*,D,*C*<sup>∗</sup> *V* (*θ*, *t* )) + 1)+ 1 <sup>2</sup> ((*Iqh*)2*AdvInd* A,*SE*(*θ*, *<sup>t</sup>* <sup>∗</sup>) + <sup>1</sup>)+(*I*3*qs*)*AdvUnf Sig*,S,<sup>F</sup> (*θ*, *<sup>t</sup>* ∗).

#### **7. Conclusions**

We propose a practical and provable patient EHR fair exchange scheme with key agreement for e-health information systems. Not only does our scheme offer a solution for the seven problems described in Section 2, when a patient attempts to migrate their personal information data to another hospital, but they can also maintain their anonymity during the data migration transaction. In addition, Table 1 shows a security and functional comparison with other related papers. It is obvious that our proposed scheme guarantees convenience, rapidity, and integrity.

Our mechanism provides secure data storage and the secure transfer of authorized information to designated locations. What information can be authorized, for example, whether COVID-19 patient privacy concerning patients' names, identities, and genetic sequences can be transmitted to different hospitals, is beyond this study's scope. This study guarantees secure data transfer and storage. Our scheme also provides a formal security proof in the random oracle model under chosen-ciphertext security. Our approach focuses on the security and privacy protection of patient EHRs rather than on the design of electronic health systems. It not only serves as a high-level functional module for integrity but also provides an efficient and contactless data transfer method that allows for medical data aggregation and protects patient anonymity, especially relevant in the context of the global COVID-19 pandemic. In the future, we will extend our scheme to be applicable for COVID-19 patient EHR exchange in a neural network environment.

**Author Contributions:** Conceptualization, M.-T.C.; Formal analysis, M.-T.C. and T.-H.L.; Methodology, M.-T.C. and T.-H.L.; Writing—original draft, M.-T.C.; Writing—review & editing, T.-H.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Acknowledgments:** This study was supported in part by grants from the Ministry of Science and Technology of the Republic of China (Grant No. MOST 109-2221-E-167-028-MY2).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **A Data Driven Approach for Raw Material Terminology**

**Olivera Kitanovi´c 1,\*,†, Ranka Stankovi´c 1,†, Aleksandra Tomaševi´c 1,†, Mihailo Škori´c 1,†, Ivan Babi´c 2,† and Ljiljana Kolonja 1,†**


**Abstract:** The research presented in this paper aims at creating a bilingual (sr-en), easily searchable, hypertext, born-digital, corpus-based terminological database of raw material terminology for dictionary production. The approach is based on linking dictionaries related to the raw material domain, both digitally born and printed, into a lexicon structure, aligning terminology from different dictionaries as much as possible. This paper presents the main features of this approach, data used for compilation of the terminological database, the procedure by which it has been generated and a mobile application for its use. Available (terminological) resources will be presented—paper dictionaries and digital resources related to the raw material domain, as well as general lexica morphological dictionaries. Resource preparation started with dictionary (retro)digitisation and corpora enlargement, followed by adding new Serbian terms to general lexica dictionaries, as well as adding bilingual terms. Dictionary development is relying on corpus analysis, details of which are also presented. Usage examples, collocations and concordances play an important role in raw material terminology, and have also been included in this research. Some important related issues discussed are collocation extraction methods, the use of domain labels, lexical and semantic relations, definitions and subentries.

**Keywords:** raw material; mining; terminology; dictionary; terminology application; mobile application; digitization; lexical data; corpus data; linguistic linked open data

### **1. Introduction**

During the last decade, lexicography entered a new era due both to rapid development of advanced computational methods and availability of previously unseen abundance of language data in different modalities. These developments have opened new opportunities for producing modern Serbian monolingual and bilingual dictionaries, which will overcome the shortcoming of existing ones, characterized by obsolescence of macrostructure, microstructure and data presentation, frequent inaccuracy of translation, visual and typographic monotony, and a neglect of needs of potential users [1]. These new, modern dictionaries will enable potential users, including students, translators, teachers, researchers and other interested parties, to find all information on formal and contextual properties of words and their interrelationships, in one place. In addition to new human readable monolingual and bilingual dictionaries, machine readable dictionaries of both kinds are also needed. In this situation, a comprehensive approach, combining all available resources, which can be used for producing various types of dictionaries, especially in specialized and terminological domains, seem to be the optimal solution.

According to the findings of the Elexis project [2], the main positive changes in lexicography in the last 10–15 years are mostly related to digitisation and automation of

**Citation:** Kitanovi´c, O; Stankovi´c, R.; Tomaševi´c, A.; Škori´c, M.; Babi´c, I.; Kolonja, L. A Data Driven Approach for Raw Material Terminology. *Appl. Sci.* **2021**, *11*, 2892. https://doi.org/ 10.3390/app11072892

Academic Editor: Chuan-Ming Liu

Received: 25 February 2021 Accepted: 16 March 2021 Published: 24 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

lexicographic work, online publishing (moving from paper to online) and, with the beginning of the corpus era, by access to corpora supported by (semi)automatic extraction of terms. Automatic data extraction comprises data that is automatically obtained from corpora of authentic language use, which is then subjected to lexicographers' post-processing or included, as is, in the published dictionary, but marked as automatically derived from corpus data. It should be noted that data derived from existing lexical databases and dictionaries should be considered as reuse of data. One of the issues related is the processing and representation of terminological phrases, or multiword expressions (MWEs), ranging from compound nouns (e.g., nickname) to complex phrasal verbs (e.g., give up) and idiomatic expressions (e.g., break the ice), which has remained a challenge over the past 20+ years [3]. In our research we focused on semantically transparent terminological phrases, as well as terminological phrases that result in a meaning shift. Some frequent syntactic patterns and translation options will be discussed. In our approach we will use a combination of: reuse of data, automatic extraction and manual postediting.

The advantage of using online platforms, which offer the possibility of regular updates and a more effective collaboration via the internet, as well as the use of mobile devices were highlighted in literature [4]. The impact of mobile devices as a distribution method is immense, and a mobile-first approach is now instrumental. The general shift towards (mobile) life online brought a clear realization that "printed lexicography"—in general terms—is a thing of the past, and this also turned its business side upside down [5].

Wide adoption of mobile devices has created new ways of learning through interaction and communication and they are becoming integrated in the lives of today's students, enhancing mobility of the learning process. Thus, for example, Language for Specific Purposes (LSP) dictionaries are now being produced at the university level using mobile LSP lexicography. One such dictionary called MobiLex was produced at the Stellenbosch University in South Africa to enhance teaching and learning of historical terms, with favorable pedagogical consequences regarding the learning of such terms. Trends and developments in technology offer the possibility of changing the face of lexicographical support in a mobile environment, from a pedagogical perspective [6].

Big data analysis methods have opened new possibilities for analyzing corpora, which contain large amounts of textual data. Thus, for example, Chen et al. [7] propose a novel statistic-based corpus machine processing approach to refine big textual data, to be used for ESP (English for Specific Purposes). The approach is based on establishing a function word list and embedding it into the program, in order to refine the word list and keyword list. The aim is to enhance the efficiency of corpora processing, starting from preparatory work, followed by generating raw data, optimizing the process, and ending by generating refined data. COVID-19 news reports are used as a simulation example of big textual data and applied to verify the efficacy of the machine optimizing process.

Electronic lexicography offers important possibilities in comparison to the traditional approach. Examples of usage may be extracted from original texts and linked to dictionary entries. There are practically no limitations to the amount of data that can be added, including multimedial data, which results in better quality data. Various search options and different possibilities of database organization contribute to the efficiency of access. Dictionaries can be easily customized for specific needs of users' groups. Electronic lexicography also enables hybridization, by breaking limits between different types of language resources—for example, dictionaries, encyclopedias, term banks, lexical databases, translation tools and the like. Finally, active user involvement is possible, by enabling collaborative or community-based input to dictionaries [8].

This paper presents a data driven approach aimed at using opportunities offered by electronic lexicography, as well as various available techniques of Natural Language Processing (NLP), to develop a semi-automatic pipeline for dictionary production. The approach is focused on raw material terminology, with an emphasis on terminology related to the mining industry, as a case study, the main goal being to cover Serbian and bilingual English-Serbian terminology in the raw material domain, within a system that can be used for developing web and mobile dictionary applications. In developing this system, a data driven approach is adopted, relying on available textual, lexical and terminological resources, both in printed and electronic form. Within the development of this system, printed resources, the paper dictionaries covering raw material terminology, were subjected to systematic extensive digitisation.

In this approach, besides compiling a comprehensive multilingual lexical database of raw material terminology, lexicographic methods for automatic knowledge extraction are used, including corpus data analysis, automatic data extraction, editing and publishing extracted data in (online) dictionaries. Using extracted lexicographically relevant data (lemma lists, example sentences, collocations) as complementary resources in electronic dictionaries is known as the one-click dictionary or push-pull dictionary model, which is used, for example, in the Sketch-engine [9] for several languages, but has not yet been used for Serbian.

A similar approach to the one outlined in this paper was applied in development of the Sõnaveeb language portal of the Institute of the Estonian Language, which contains data from a number of dictionaries and termbases, with a total of 200,000 Estonian headwords with collocations, etymology, multi-word expressions, etc. The main issues to be resolved in their approach were the consistency of information, deduplication, parsing data fields containing more than one data element, moving from annotating form (e.g., italics) to annotating content (e.g., a citation) [10].

López-Úbeda et al. [11] present another interesting approach, which also combines different NLP techniques to develop a system for identification of biomedical terms in textual documents written in Spanish. The approach was applied for recognizing biomedical entities in various types of texts, including different knowledge resources (MedLine Encyclopedia, International Classification of Diseases, Unified Medical Language System, etc.). Although the tool developed within their approach has been developed for Spanish, the authors plan to expand its usability by incorporating multilingual support in the future, thus enabling it to be extrapolated to other languages.

The web and mobile applications for raw material terminology developed as a result of our approach are primarily intended for students and engineers involved in the raw material industry, as an aid in mastering terminology. They offer both English-Serbian and Serbian-English terminology, developed, inter alia, on using a comprising a variety of literature from the field of raw materials. Existing terminological dictionaries and general language dictionaries served as control dictionaries (listed in the bibliography and described in Sections 2.1 and 3.1). The developed dictionaries are not comprehensive, but rather contain basic terminology from various raw material subdomains (areas), needed to make reading professional literature easier, academic writing purposes and to improve communication among professionals in the raw material industry. In addition to core raw material terms, some technical and academic vocabulary is also introduced, that is, words that often appear in professional literature.

The developed dictionaries are not prescriptive, as they do not prescribe how the terminology "should" be systematized, but rather record the terms in use. Therefore, they feature synonyms and also record technical jargon and localisms next to standard terminology. For example, *'rotorni bager'*, namely, *'bucket wheel excavator'*, is recorded on the Serbian side together with *'glodar'*, a jargon term, literally translated as *'gnawer'*. The publication of the dictionaries as a mobile app is especially important in view of the fact that the job of an engineer dealing with raw materials usually involves frequent field work and staying in the field for prolonged periods.

Section 2 gives an overview of available resources: paper and electronic dictionaries, as well as corpora used. Section 3 outlines preparation of resources, which includes digitization of paper dictionaries, enlargement of corpora, adding domain terms to general purpose morphological e-dictionaries and extraction of bilingual lists. The process of terminology compilation, from the perspective of monolingual and bilingual extraction, a well as the web and mobile form of the dictionary are given in Section 4. The last section

offers a discussion, concluding remarks and outline of future plans for improvements and application in other areas.

#### **2. Available (Terminological) Resources**

Our approach relies heavily on available resources, both in paper and electronic form, such as traditional, paper dictionaries used in raw material industry, termbases covering raw material terminology, corpora of texts from the raw material domain as well as general-purpose electronic dictionaries of Serbian. This section offers an overview of these resources.

#### *2.1. Paper Dictionaries for Raw Material Domain*

The Bureau of Mines (U.S. Department of the Interior) had pioneered efforts in mining terminology, beginning in 1918 with Fay's "Glossary of the Mining and Minerals Industry", and continuing by the 1968 publication of "A Dictionary of Mining, Minerals, and Related Terms" (DMMRT). In this 5-year project, more than 100 bureau personnel (engineers, scientists, and editors) were involved in the technical review and publication production process of the dictionary, with 28,750 terms explained by 37,180 sense definitions [12]. This dictionary has been used for several decades at the University of Belgrade Faculty of Mining and Geology (UBFMG), and it is the main dictionary covering mining terminology in English in our approach. Online version of dictionary is published on The Edumine platform that provides professional development training for people in the mining industry [13].

A multilingual "Mining dictionary: Serbo-Croatian: English: French: German: Russian" (MD), containing 16,500 terms related to underground and surface excavation, preparation of mineral raw materials, as well as rock and soil mechanics in five languages was published in 1970 [14]. This dictionary also contains terms from the fields of geology, metallurgy, electrical engineering, mathematics with computational methods, and civil engineering, to the extent they are related to mining. Each term entry has a Serbian headword, sometimes followed by synonyms, which is aligned with translations in four languages— English, French, German, and Russian. The interconnection of all five languages is given by additional indexes. Term entries do not have definitions nor usage examples. the dictionary being almost 50 years old, many terms are outdated, while some new terms are missing. This dictionary was our main source for extracting terminological equivalents in Serbian and English.

The first terminological "English-Croatian-Serbian Petroleum Dictionary" for the field of petroleum engineering [15] was followed, after 30 years, by the "English-Croatian encyclopedic dictionary of oil and gas exploration and production" [16], which is used both in Croatia and Serbia. With 12,200 definitions and 7100 terms, it contains a comprehensive vocabulary of both scientific and professional terms used by scientists, experts and students in the area of exploration and production of oil and gas, but also petroleum geology, geophysics, development deposits, drilling and equipping wells, ecology and other disciplines.

There is also a small bilingual dictionary of mineral processing [17] with 2415 translation pairs, in both directions, English to Serbian and Serbian to English, but also without definitions. Finally, a glossary of mineral processing terms with 1400 definitions in Serbian is used at the UBFMG, although it was not officially published [18].

All these dictionaries, and a number of other dictionaries, a total of 22, have been digitized for the purpose of our approach.

#### *2.2. Digital Resources in Raw Material Domain*

The development of digital resources for raw material terminology has been an ongoing activity at the UBFMG for several years now. It started with research related to the development of an ontology of mining equipment [19], in line with other research aimed at development of bilingual lexical resources [20]. The focus was then turned to development of termbases for the general field of mining engineering, and their transformation from their initial custom in-house scheme into the TermBase eXchange (TBX) Standard [21]. Another terminological resource, mostly handcrafted, was also developed to support knowledge management in specific subfields of mining engineering, such as mining equipment, mine safety and geostatistics [22]. A thesaurus of mining terminology is available online, but it is not systematically updated. Moreover the application has no new features, and it is not responsive. A modest experiment was made with developing students' vocabulary related to raw materials through flashcards and L1 in the CLIL Classroom [23], but it was not finalized with publicly available online resources.

Three digital resources already developed at UBFMG were included in our approach, two termbases, Termi [24], and GeoliSSTerm [25], and one ontology, Rudonto [26]. Termi supports development of terminological dictionaries in various fields (mathematics, computer science, raw material, library science, computational linguistics, power engineering, etc.) [27,28], and it has been selected as the most suitable resource to be used for the comprehensive multilingual lexical database of raw material terminology, while the remaining two resources have been incorporated in the dictionary production pipeline.

For systematic development of raw material terminology, textual resources, namely, bilingual libraries and corpora are also needed. Thus, articles from the scientific journal Underground Mining, published both in Serbian and English, stored in the bilingual digital library Bibliša, as one of the collections of aligned English-Serbian bi-texts [29,30], were also used in our approach.

A monolingual corpus from the mining domain was developed as part of a project related to managing mining project documentation using human language technology [31] and used within this research in the web and mobile applications.

#### *2.3. General Purpose Morphological Dictionaries*

Serbian has an extensive system of inflection and a complex agreement system that makes extraction of terminology more complicated, and thus the use of general purpose morphological dictionaries is indispensable for every lexicographic task [32].

An important lexical resource used for morphological analysis and extraction are the comprehensive electronic morphological dictionaries for Serbian (SrpMD) of simple- and multi-word units, covering general lexica, proper names, encyclopedic knowledge and terminology from a number of domains [33], with nearly 200.000 lexical entries. SrpMD entries include both a lemma and inflected forms supplied by grammatical information, semantic markers, domain information and relations of several types: derivational, lexical variation, component relations (between single words and terminological phrases).

For example, lexical entry *'rudar'* (miner, person engaged in mining, a worker in a mine) contains information related to part of speech: *'N'* (noun), morphological class *'N2'*, semantic tag *'+Hum'* (human), domain *'DOM = mining'*. Its inflected forms are: *'rudar'* (ms1v), *'rudara'* (mp2v:ms2v:ms4v:mw2v:mw4v), *'rudare'* (mp4v:ms5v), *'rudari'* (mp1v:mp5v), *'rudarima'* (mp3v:mp6v:mp7v), *'rudarom'* (ms6v), *'rudaru'* (ms3v:ms7v) where brackets show grammatical information: *'m'*—masculin, *'s'*—singular, *'p'*—plural, *'1–7'* cases, *'v'*—animate.

The entry *'rudar'* is also related to the relational adjective *'rudarski'*, and appears as a component of several terminological phrases, for example, rudar na okresivanju (ripper), rudar na uglju (collier), rudar-podgradivaˇ ¯ c (timberman), and so forth.

Over the past years, more entries related to raw material were added to SrpMD, which initially contained more than 3000 simple-word entries and 2000 multi-word entries from the raw material domain. The number of their morphological forms recorded in this resource is significantly larger. The simple-word forms pertaining to raw material terminology that have been processed and included in SrpMD [34] enabled further extraction of related terminological phrases according to the methodology described in [19]. Namely, for extraction to be effective, it is very important that the domain is relatively well covered with simple domain-specific words.

#### **3. Resource Preparation**

Preparation of resources is aimed at expanding and enriching available digital resources. These activities are not to be understood as one-time only activities, as each of them can be repeated periodically, when new opportunities for resource enrichment appear.

#### *3.1. Dictionary (Retro)Digitisation*

In order to expand and enrich the available digital resources, a number of paper dictionaries were digitised in the preparatory phase. After scanning, OCR and transformation to MS Word, with preservation of formats (bold, italic), manual correction was performed. The Word documents were then parsed, by a parsing procedure that was fine-tuned for each dictionary, according to its structure. Parsed data were finally transformed to structured formats: excel and xml, before being imported to the internal relational database. The procedure will be illustrated on one multilingual dictionary (MD) and one monolingual dictionary (DMMRT).

The digitisation and parsing of MD produced 16,491 term entries (examples of term entries are given in (Figure 1), where Serbian terms were aligned with one or more English term equivalents (the remaining 3 languages were also stored in the database, but they were not used in this approach).


**Figure 1.** Examples of scanned Mining dictionary entries.

The majority of dictionary entries (15,016) contained only one Serbian term, but there were 1355 entries with two terms, and 120 with 3–5 terms, resulting in a total of 18,092 Serbian terms, of which 16,916 distinct. As to the English part of the dictionary, there were 13,163 entries with one term, 2553 with two terms and 775 with 3–8 terms, resulting in a total of 20,878 English terms, of which 17,774 distinct.

Raw material terminology, akin to general technical terminology, contains a large number of multi-component terms. In the dataset obtained from the dictionary 23% of English entries are single word terms, 50% are two-component terms, 18% have three components and the remaining 9% have four or more. As for Serbian entries, 22% are onecomponent terms, 47% have two components, 17% have three, and the remaining 14% have four or more. The majority of English multi-compound terms are noun compounds. These linguistic constructions are most often composed of two or more nouns. for example, *'coal waste'*—*'jalovina'*, *'waste dump'*—*'odlagalište jalovine'*, *'gas pressure'*—*'pritisak gasa'*. However, they can also contain three, four or more nouns, for example, *'gas protection apparatus'*— *'liˇcna zaštitna sredstva od gasova'*, *'mud circulation pressure hose'*—*'isplaˇcno crevo'*.

Given the frequency of multi-component terms, an analysis of translational equivalents in English and Serbian was performed in terms of the number of their components. It was found that in 20% of cases both translational equivalents have one component, in 31% of cases both have two components, in 15% of cases the Serbian term has one component more than the English term, while in 13% of cases the English term has one component more, in 5% of cases the Serbian term has two components more, and in 3% of cases English has two components more. All other cases cover the remaining 13% of cases.

Entries in DMMRT have one or more senses per each term, described by a definition, and labeled by small letters *a*, *b*, *c*,... , *u*. Each individual sense can be related to one or more other terms in the dictionary, and it can be followed by its bibliographic source. Digitization of DMMRT yielded 28,757 terms with a total of 37188 sense definitions, where 24,115 terms have only one sense, 2942 have 2, 890 have 3, 641 have 4–6, 139 have 7–10, and 34 have 11–21. The most polysemous word is *'head'* with 21 senses, followed by *'drift'* and *'bottom'* with 20 senses. Types of relations between entries can be: See (4090), See also (3983), CF (compare, 1824), Ant (antonym, 20), Etymol. (etymology, 130), Syn: or syn.(synonym, 2532), Abbrev. (abbreviation, 77), etc. Figure 2) presents the entry *'accessory plate'* with five senses, marked by letters a-e. Two senses (a and e) are related to other dictionary terms (a to *'quartz wedge'* by CF, and e to three synonyms and two other terms by CF), and two senses (b and c) are followed by their source (Pryor).

**Figure 2.** An example of scanned entry from DMMRT.

As to the components of the terms in DMMRT, 37% of the total terms are single word terms, 50% are two-component terms, 10% have three components and the remaining 3% have 4–7 components. Comparison with the English part of MD shows a similar pattern, as the percentage of two-component words is equal, while MD has 14% less one-component terms.

Additional 19 dictionaries from the raw material and related domains were digitized, parsed and stored in the database, adding 63,571 new entries. Five monolingual English dictionaries from the mining domain produced 5933 entries, three bilingual mining English-Serbian dictionaries produced 24,049 entries, three monolingual English dictionaries covering terminology from the mine safety domain contributed with 655 entries, and an English-Serbian dictionary of terminology in the field of waste management yielded 1968 entries. Dictionaries from related domains were also included, namely four English dictionaries producing 21,448 entries and three bilingual dictionaries producing 9518 entries.

One of the observations, even before this research started, was that several terms in paper dictionaries are not in use anymore. That observation initiated frequency calculation of Serbian terms in the mining corpus. Frequency in the corpus and the number of dictionaries that attest a term were the main criteria for post editing priority of the term.

Entries from all digitized dictionaries were stored in the same database, but in different structures, which correspond to their original data schema, and with reference to the original source. All of the structures can, in general, be mapped to the union of the structures of the two dictionaries presented in more detail, MD and DMMRT. Thus, a terminological entry in the common database can consist of a headword (list), rarely partof-speech, equivalent(s) in other language(s), usually one, but sometimes more, labeled senses that include definitions, occasionally synonyms and abbreviations, links to other entries, bibliography, rarely specific domain.

#### *3.2. Corpora Enlargement*

The monolingual corpus of texts from the mining domain and related research work, which comprised 172 documents (in Serbian) with 2.7 million words in first release [31], was subsequently enlarged with 63 documents. The current version has 4.1 million words, covering project documentation (26%), legislation (11%), doctoral dissertations (31%), textbooks and other mining literature (32%).

The bilingual corpus of texts aligned on the sentence level was produced from the bilingual digital library Bibliša. The initial set of 55 documents containing 4831 aligned Serbian-English sentences [29] was enlarged with 44 new documents containing 12,657 aligned sentences from the raw material and energy domains.

The crucial linguistic preprocessing steps within corpora enlargement are part-ofspeech tagging and lemmatization. Part-of-speech tagging represents an automatic text annotation process in which words or tokens are marked by part of speech tags, which typically correspond to the main syntactic categories in a language (e.g., noun, verb). Lemmatization is the process by which inflected forms of a lexeme are grouped together under a base dictionary form. The Serbian corpus and the Serbian part of the bilingual corpus are tagged and lemmatized using a customised tagger [35], while the English part of the bilingual corpus is tagged by Treetagger [36,37].

Texts included in corpora are also processed using electronic dictionaries and local grammars. It is important to note that text processing and related mining vocabulary expansion is an iterative process. Namely, among other tasks, corpora are used for extraction of mining terminology, definitions and usage examples by applying different methods and tools.

#### *3.3. Adding New Serbian Terms to General Lexica Dictionaries*

Terminology from digitized dictionaries of raw material terminology in Serbian was checked by SrpMD and the corpus from the mining domain, for possible adding to SrpMD. We will illustrate this procedure by the results obtained from MD. The Serbian part of MD that contains headwords was transformed into a text, which was then analysed by SrpMD. Out of 12,655 different single words found in the text produced from the dictionary, 9758 were recognized by SrpMD. Among the 2897 (23%) that were not recognised, there were some acronyms (e.g., *'pH'*, *'RR'*, *'LD'*, *'TV'*), names (e.g., *'Western'*, *'Bets'*, *'Reni'*), archaisms (e.g., *'abanje'* instead of *'habanje'* (wear and tear), *'bolcn'* instead of *'zavrtanj'* (screw), etc.), as well as some OCR errors (despite manual check-up). Based on this analysis, a set of candidates for new entries into SrpMD were prepared (e.g., *'degazacija'* (degassing), *'eksploatabilan'* (exploitable), *'sabirnik'* (busbar), etc.). Each candidate was further checked against the mining corpus, and if the result (basically, its frequency) was satisfactory, it was added to the SrpMD.

The same procedure was applied to other dictionaries with Serbian entries. While the comprehensive terminological dictionaries (such as MD) contained a lot of simple words that were missing in SrpMD, smaller dictionaries, as expected, included frequently used terms that were mostly already in SrpMD. Thus, for example, in Electropedia 13% of words were not recognized by SrpMD, while in the Serbian part of the English-Serbian dictionary of terminology in the field of waste management 6% of words were not recognized. In all other dictionaries the percentage of unrecognized words was between 3%–5%, but whether they would be included into SrpMD depended on their frequency in the mining corpus.

Besides the digitized dictionaries, the Serbian corpus and the Serbian part of the bilingual corpus from the mining domain were yet another source of new raw material domain terms that did not exist in SrpMD. Extraction of simple words was relatively simple, namely, words that were not recognized by SrpMD were scrutinized, and if frequent enough, they became candidates for being added to SrpMD. Besides, less than 4% of words in the monolingual mining corpus were unrecognised by SrpMD, where approximately 1.3% out of these 4% were proper candidates to be added to SrpMD, the remaining unrecognized words being variables from equations (0.7%), acronyms (1%), low frequency (hapax and typos—0.5%), foreign names and words (0.5%).

However, when it comes to terms in the form of terminological phrases, their extraction from corpora becomes much more complicated. Automatic extraction of term candidates for Serbian relies on a procedure presented in [30,34]. Essentially, it is based on detecting words in corpora that follow one of the 23 specific syntactic patterns, most frequent for noun terms (AN adjective-noun, NNg noun-noun in genitive case, AAN, ... ). The first step in this task is to recognise and extract Serbian terminological phrases from the corpus using syntactic patterns, and calculate their frequency. Frequency was the main parameter for determining the rank of a terminological phrase as a candidate for processing for SrpMD. However, other measures of association, such as T-Score, Keyness, Log-likelihood, were also used, as described in detail in [30]. The task then proceeds by lemmatization of candidate terminological phrases, disambiguation for terminological phrases where more lemmas can be produced, and ends by production of the final lemma, which enables production of all inflected forms for each terminological phrase.

As in the case of single terms, frequency for terminological phrases was also calculated for each single-word component of the phrase, but for its lemma, not for the exact inflected form. Having in mind free word order in terminological phrases we were looking for a measure more loose than exact match. For each terminological phrase the following information is stored: minimum, average and maximum frequency of its components, number of "known" components-words recognized by SrpMD. Frequency in the corpus and the number of dictionaries that attest a term are the main criteria for post editing priority of the term.

For this paper, extraction of Serbian terminological phrases was performed with a frequency threshold of 10, and 12,632 candidate phrases were produced in lemmatized form. Frequency of each terminological phrase was calculated as the sum of frequencies of all its inflected forms. For example, *'kvalitet uglja'* (coal quality) has a frequency of 1110 as a sum of frequencies of its forms: *'kvalitet uglja'* (172), *'kvaliteta uglja'* (587), *'kvalitetom uglja'* (284), *'kvalitetu uglja'* (53), *'kvalitete uglja'* (8), *'kvaliteti uglja'* (2), *'kvalitetima uglja'* (4). Six most productive patterns, which produced 92% of candidates, are listed with examples and their frequencies:


Evaluation follows, where the following is checked: is the extracted candidate a terminological phrase, which domain (mining, technical, etc.) and possibly subdomain it belongs to. If the domain or subdomain are identified, the appropriate semantic markers are assigned to the terminological phrase. After the evaluation process, all correctly evaluated terminological phrases were prepared for insertion into the terminological database Termi.

#### *3.4. Adding Bilingual Terms*

Bilingual lists of terms were considered a valuable resource in our approach, and they were generated from two sources, namely, by retrieval from the bilingual MD and by extraction from the aligned bilingual corpus.

Term entries from MD were parsed and only those that were confirmed by the mining corpus (monolingual or bilingual) were selected. As mentioned before, one term entry can comprise more terms (single or multi word) and confirmation for each term was looked for.

A total of 10,059 term entries from MD were retrieved, with sets of English terms aligned with sets of Serbian terms. The majority of them were subsequently marked by domain (24 different), subdomain (15) and semantic markers (35) as mentioned in Section 3.1. All markers used are subsets of markers—data category values in srpMD.

Bilingual terminology was extracted from the aligned bilingual domain corpus described in Section 3.2 using terminology extractors for Serbian and English, and Bilte [38]), a tool for chunk alignment [39,40]. The method combines the approach with existing domain terminology lexicons with term extraction tools. For English, FlexiTerm [41] was used with threshold 3 and TermSuite [42] with threshold 4, based on the experience from other domains and the fact that they use different linguistic filtering. A total of 8456 term candidates for English were selected. For Serbian, the same shallow parser was used as in the case of monolingual extraction (Section 3.3), as well as the same calculation of termhood, a frequency-based measure, which qualified 7825 candidates as terms.

Monolingual lists of extracted terms were further expanded by terms retrieved from digitized dictionaries yielding 94,539 English terms and 48,096 Serbian terms. Some terms were found in both datasets: extracted from text and retrieved from dictionaries, namely, a total of 2285 English and 308 Serbian terms.

The GIZA++ [43] and Moses toolkit [44] for statistical machine translation (SMT) were used for word alignment. Aligned chunks, presented in the so-called phrase table, are obtained as output from Moses, together with their phrase translation scores. After pruning the phrase table with the threshold probability of 0.85, the remaining chunks were lemmatized and further filtered to select those in which both parts of the pair contain a candidate term from the raw material domain. More details about options and the procedure are available in [40]. The output of this phase contained 8202 Serbian-English pairs as term candidates whose English part was confirmed and 3605 where both language parts were confirmed. In the first step, candidates that were found in digitized dictionaries, or were already assessed as terms, were automatically confirmed, but candidate pairs had to be inspected manually, which yielded a list of 2737 term pairs. General terms, such as, *'red' (row)*, *'kompozicija' (composition)*, *'din' (dinar)*, *'minimalan' (minimum)*, *'izvor informacija' (source of the information)*, ... were excluded, as well as those wrongly aligned, such as: *'naftovod' (pipeline oil)*, *'mreža' (telephone network)*, *'deponija' (deposit)*, *'oblik poklopca' (shape of the cover)*, . . . A wider set of terms will be evaluated in the near future.

For evaluation of bilingual candidates, besides frequencies for single terms, we have also used a heuristic for evaluating terminological phrases based on the following observations. The last noun in English noun compounds, which represent the majority of English terminological phrases, as a rule, is the head word carrying the basic meaning, while the preceding nouns are narrowing this meaning, that is, behaving like adjectives. The meaning of a noun compound in English thus flows from right to left, but the Serbian translational equivalent cannot be formed analogously, namely, by a sequence of corresponding Serbian nouns. Thus, within the analysis, the most frequent constructions used as Serbian translational equivalents for English noun+noun compound were determined:


This heuristic was used to select the most promising candidates among the extracted bilingual terminological phrases.

As in the case of multilingual terms and terminological phrases, after the evaluation process, all correctly evaluated bilingual terms were prepared for insertion into the terminological database Termi. So far, more than 3000 term-to-term pairs were inserted. In this process they were merged to form synonymous sets (synsets) by using information from existing dictionaries and simple rules, such as: if two English terms are translated by the same Serbian term they are candidates for synonyms.

#### **4. Terminology Aggregation and Presentation**

*4.1. Data Integration Procedure—The Pipeline*

The main goal of our approach is to merge and link all available terms in the raw material domain into one lexicon structure, within the terminological database Termi and as linguistic linked data available via SPARQL endpoint, in the first place by aligning as much as possible term entries from dictionaries and other resources covering raw material domain terminology. Besides the aim of aggregating terms from different resources, one of the reasons for alignment of terms from multiple dictionaries (paper and electronic) was to assess term usage, which determines its importance for raw material terminology. On the other hand, alignment of terms with SrpMD was necessary, since these dictionaries are a base resource for lemmatization and multiword term extraction. Since SrpMD are already in the lexical database Leximirka [32], developed and managed by the same research team, this type of alignment was possible.

Figure 3 presents an outline of the pipeline for termbase population, which starts with collecting and preparing research papers, project documentation, and textbooks in Serbian for the monolingual corpus and aligning English-Serbian texts for the bilingual parallel corpus. Also, paper dictionaries, both monolingual and bilingual are digitized, parsed and stored in an auxiliary database as structured data in XML format.

**Figure 3.** The pipeline for terminology compilation (termbase population).

Compiled resources also comprise monolingual lists derived from all available resources, interlinked with their source entries, for example Serbian list from Serbian monolingual dictionaries and Serbian part of bilingual dictionaries. Translation equivalents are retrieved from bilingual dictionaries and within the word alignment phase (more in Section 4.2), keeping again information about the original dictionary source.

Extracted terms were also subject to a labeling procedure, which we will illustrate here on the example of MD. Out of 16,491 entries obtained from MD, 12,018 (73%) were

manually classified and markers for domain and subdomain, as well as semantic labels, were assigned to them. The remaining 4473 (27%) unclassified entries included words from general lexica and some rarely used terms. The classified entries are mostly from the mining domain, more precisely, there are 4793 (40%) entries common for different areas of mining. The basic vocabulary from related domains is also included, for example, 2398 (20%) entries related to geology, hydrogeology and geography, 860 (7%) entries related to transport, rock mechanics, surveying, environment protection, safety, construction, transport and electrical engineering, while 3082 (26%) entries belong to the general technical terminology. There are also entries from basic science, for example, 885 (7%) terms related to biology, chemistry, mathematics, informatics and physics.

Among entries from the mining domain, those related to a specific subdiscipline of mining were identified by mining experts, and marked by a subdomain marker, as for example, entries related to mineral processing (251), transport (243), or underground mining (469). Additional semantic labels were also assigned, for example, material (699), device (536), machine (384), mineral (313), facility (288), instrument (279), etc.

The part-of-speech was semi-automatically assigned, where only 40 entries were marked as adjectives, 250 as verbs, and all other as nouns.

Lexical entry alignment with DMMRT is performed using terms on the English side of the MD. Since one English term can have several senses, such alignments are marked for manual filtering. An indicator is used for status: automatic relation or manually evaluated.

A terminological dictionary must accompany each entry with a scientifically and lexicographically correct definition [45]. There are very few such dictionaries in the Serbian language, as most of the published Serbian terminological dictionaries are only translational (bilingual or multilingual). An ongoing activity is the adaptation of English definitions, which are the most comprehensive in DMMRT, to Serbian, in the post-editing phase, where priority is given to the most frequent terms, both in the corpora and in the dictionaries.

Finally, candidates are harmonised and assembled to the microstructure of the lexical database Termi, which consists of a headword, synonyms, abbreviations, definition, for each language, bibliographic source and possibility to include illustration and other external content. Term entries in Termi are organised into a hierarchical structure, and additional relations between entries are envisaged, but still not implemented. Automatic hierarchical positioning was based on subdomain and semantic markers, but it is subject to repositioning in the post-editing phase.

Information integration beyond the level of individual dictionaries and across the language resource community has become an important concern, and the most promising technology to achieve this goal is to adopt the Linked (Open) Data (LOD) paradigm for publishing lexical resources, that is, to use URIs for unambiguously identifying lexical entries, their components and their relations in the web of data—to make lexical datasets accessible via http(s), to publish them in accordance with W3C-standards such as RDF and SPARQL, and to provide links between lexical data sets and with other LOD resources [46].

In our research we were also aiming at compatibility with the Linked Data approach, using its set of design principles for sharing machine-readable interlinked data on the Web. This vision of globally accessible and linked data on the internet is based on RDF standards of the semantic web, using RDF serialisation for data representation. To that end, our approach envisages export of lexical database data in RDF that is compliant with the *The OntoLex Lemon Lexicography Module* [47], lexicog [48], as an extension of Lexicon Model for Ontologies (lemon) [49,50]. This is also in line with activities within NexusLinguarum COST action [51], which promotes synergies across Europe between linguists, computer scientists, terminologists, language professionals, and other stakeholders in industry and society, in order to investigate and extend the area of linguistic data science. An example of RDF export is presented in Figure 4 followed by the Turtle RDF Syntax [52] to illustrate the use of the model.

**Figure 4.** The graph for the translation of lexical entries: *'fossil fuel'*-*'fosilno gorivo'*).

```
:fossil_fuel a ontolex:LexicalEntry;
   dct:language <http://lexvo.org/id/iso639-1/en> ;
   lexinfo:partOfSpeech lexinfo:noun;
   ontolex:lexicalForm :fossil_fuel-form;
   ontolex:sense :fossil_fuel_sense.
:fossil_fuel-form a ontolex:Form;
   ontolex:writtenRep "fossil fuel"@en.
:fossil_fuel_sense skos:definition "coal, oil, gas, oil sands or oil shale"@en;
   ontolex:reference <https://dbpedia.org/page/Fossil_fuel>;
   ontolex:reference <https://www.wikidata.org/wiki/Q12748>;
   ontolex:reference <http://eurovoc.europa.eu/6045>.
:fosilno_gorivo a ontolex:LexicalEntry;
   dct:language <http://id.loc.gov/vocabulary/iso639-1/sr> ;
   lexinfo:partOfSpeech lexinfo:noun;
   ontolex:lexicalForm :fosilno_gorivo-form;
   ontolex:sense :fosilno_gorivo_sense.
:fosilno_gorivo-form a ontolex:Form;
   ontolex:writtenRep "fosilno gorivo"@sr.
:fosilno_gorivo_sense skos:definition "ugalj, nafta, gas, naftni pesak ili
   uljni škriljci"@sr;
   ontolex:reference <https://www.wikidata.org/wiki/Q12748>.
:trans_fossil_fuel_sense-fosilno_gorivo_sense a vartrans:Translation;
       vartrans:source :fossil_fuel_sense;
       vartrans:target :fosilno_gorivo_sense;
       vartrans:category
           <http://purl.org/net/translation-categories#directEquivalent>.
```
Further details related to the above example, namely, the novel module for frequency, attestation and corpus information (FrAC) [53] is described in the next section.

#### *4.2. Dictionary Examples and Frequencies*

None of the dictionaries we have used contain examples of term usage. Our intention was to select actual terms that can be found in domain texts and to link usage samples to both monolingual and bilingual terms entries. Previous (and actual) practice in Serbian lexicography has relied on retrieving example candidates and definitions manually from different online sources and printed material (over a number of years), but it is evident that a more systematic and corpus-evidence-based approach was needed.

A method for the selection of good examples for Serbian terms was developed based on a feature extraction web services and knowledge retrieved from SASA Dictionary as the Gold Standard for Good Dictionary Examples (GDEX) for Serbian [54]. The method is based on a detailed analysis of various lexical and syntactic characteristics of examples in published dictionaries. The initial set of functions was inspired by a similar approach

for other languages. The distribution of the characteristics of examples from this corpus is compared with the characteristics of the distribution of the sample sentences extracted from the corpus that contains different texts. The approach was adapted to work also for English and to be applied for bilingual aligned sentences. For ranking, we have used a weighted score derived from lexical features (e.g., sentence length, number of all no space chars, digits, weird chars, commas, full stops, punctuation, number of all tokens, average token length, max token length, sentences between 15 and 40 tokens, ... ), word-based features (e.g., number of words, capitalised words, ... ) and other features (e.g., average frequency in corpus, number of stop words, proper names, pronouns). New features were introduced for bilingual examples, for example, difference in sentence length measured in words, where examples in which a sentence in one language is short and in the other language long are avoided. An example containing terms as key words in context in English and Serbian, sentence examples and calculated features is:

109867|7.2011.60.8|7.2011.60.8\_n44|Fossil fuel|Fosilno gorivo|Carbon emissions from sources other than fossil fuel combustion are now incorporated in the National Footprint Accounts.|Emisije ugljenika iz drugih izvora, ne samo iz sagorevanja fosilnih goriva sada su ubeležene u Izveštaje o nacionalnoj stopi emisije zagadenja.|120|104|0|37|0|1|1|True|18|5.778|12|True|True|True|True|17| 6.0588|12|2|3|0.0|7|145|124|0|52|1|1|3|False|23|5.392|12|True|True|True|True| 20|5.5|11|1|1|10955.428|7

For entries with no examples in the bilingual corpus, monolingual examples were extracted from the Serbian mining corpus. Apart from offering preselected examples, it is important to enable the user to browse the concordances for a lemma, as well as syntactic patterns, as presented in the next section in Figure 5.


**Figure 5.** The Leximirka app for lexical database management.

Relative frequency (normalized per million) is assigned to terms from the mining corpus (as domain specific) and for the corpus of standard Serbian (as reference), in order to calculate the so-called keyness score, which is expected to represent the extent of the frequency difference.

Frequency information is a crucial component in human language technology, so the FrAC module includes terminology to capture such information, in order to facilitate sharing and utilising this valued information [53]. Sketch engine API [55,56] is used for calculation of frequencies, for word-sketch retrieval with collocations and for thesaurus with related words association measures (Statistics used in the Sketch Engine [57,58]). The Python script prepared in the form of a jupyter notebook was published at github [57]. Current work of the Ontolex group is focused on modeling word embeddings, collocations and similar words and we will add this feature when it becomes stable. An example of ontolex-lemon frequency and attestation snippet is:

```