*Article* **Improved Large Covariance Matrix Estimation Based on Efficient Convex Combination and Its Application in Portfolio Optimization**

**Yan Zhang <sup>1</sup> , Jiyuan Tao <sup>2</sup> , Zhixiang Yin <sup>1</sup> and Guoqiang Wang 1,\***


**Abstract:** The estimation of the covariance matrix is an important topic in the field of multivariate statistical analysis. In this paper, we propose a new estimator, which is a convex combination of the linear shrinkage estimation and the rotation-invariant estimator under the Frobenius norm. We first obtain the optimal parameters by using grid search and cross-validation, and then, we use these optimal parameters to demonstrate the effectiveness and robustness of the proposed estimation in the numerical simulations. Finally, in empirical research, we apply the covariance matrix estimation to the portfolio optimization. Compared to the existing estimators, we show that the proposed estimator has better performance and lower out-of-sample risk in portfolio optimization.

**Keywords:** covariance matrix estimation; shrinkage transformations; rotation-invariant estimator; portfolio optimization

**MSC:** 90C25; 62P05; 62P20

**Citation:** Zhang, Y.; Tao, J.; Yin, Z.; Wang, G. Improved Large Covariance Matrix Estimation Based on Efficient Convex Combination and Its Application in Portfolio Optimization. *Mathematics* **2022**, *10*, 4282. https:// doi.org/10.3390/math10224282

Academic Editors: Xiang Li, Shuo Zhang and Wei Zhang

Received: 24 October 2022 Accepted: 13 November 2022 Published: 16 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** c 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

### **1. Introduction**

With the development of information technology, the covariance matrix estimation plays a crucial role in multivariate statistics analysis, and it is used widely in many fields, such as finance, wireless communications, biology, chemometrics, social networks, health sciences, etc. [1–5]. In particular, due to the high noise of the sample covariance matrix, the properties of financial data are not characterized by multivariate normality and stationarity [6]. As an essential input to many financial models, it is vital to remove the sample noise to improve the estimation accuracy of the covariance matrix in asset allocation and risk management [4,7,8]. It is known that the sample covariance matrix is no longer a good estimator of the population covariance matrix when the dimension of the matrix is close to or larger than the sample size. In fact, the sample covariance matrix becomes a singular matrix in high-dimensional data. The so-called "high dimensions" mainly include large orders of 30 in magnitude and high data dimensions [1–3]. So far, some popular ways used to obtain a good estimator are the shrinkage estimation methods without prior information, sparse estimation methods with prior information, the factor model [9,10], the rank model [11], etc.

The shrinkage method, proposed by Stein [12], is one without prior information for estimating the covariance matrix. The essential idea of this method is to pull extreme eigenvalues of the sample covariance matrix toward the mean of the eigenvalues by shrinking the eigenvalues when the dimension of the matrix is close to the sample size. Ledoit and Wolf showed that the shrinkage estimation methods have an improvement over the sample covariance matrix. Specifically, they proposed the shrinkage estimation methods provide good solutions to deal with the overfitting of the sample covariance matrix [8,13,14].

Since the linear shrinkage method is the first-order approximation to a nonlinear problem, as the dimension of the matrix becomes high, it is no longer suitable for the improvement of the sample covariance matrix. Thus, they proposed the nonlinear shrinkage method [15], which has better performance for high-dimensional asymptotics. Recently, Ledoit and Wolf proposed optimal nonlinear shrinkage estimators [16], which are decisiontheoretically optimal within a class of nonlinear shrinkage estimators. For more details on the shrinkage methods, refer to [2,3,17,18].

The sparse estimation with prior information is another one for estimating the covariance matrix, which estimates the sparse matrix directly and its inverse indirectly. In the case of direct estimation, Bickel and Levina [19] showed that the estimation can be obtained by the threshold methods under the hypothesis of the sparseness of the true covariance matrix. In the case of no assumption of the sparse pattern, Rothman et al. [20] proposed a new class of generalized threshold estimators to obtain the sparse estimation by inducing sparsity and imposing the norm penalty. Theoretically, these methods are shown to be superior, and the generalized thresholding estimators are consistent with a large class of approximate sparse covariance matrices. In fact, the resulting estimators are not always positive-definite. In order to guarantee the positive definiteness of the covariance matrix estimation, Rothman et al. [21] built a convex optimization model based on the quadratic loss function under the Frobenius norm (*F*-norm) and studied the estimation of the high-dimensional covariance matrix. Subsequently, some convex optimization models with penalty functions such as the *L*<sup>1</sup> function were proposed [22,23], and some nonconvex penalty functions were used to achieve both sparsity and positive definiteness [24,25]. However, since the changes of the variance and covariance over time are not considered, they are affected by dimensional disasters and large noise problems. For more details about optimization algorithms and inverse matrix estimation methods, refer to the literature [7,26–31] and the references therein.

In addition, Bun et al. [32] introduced the rotation-invariant estimation in which they assumed that the estimator of the population correlation matrix shares the same eigenvectors as the sample covariance matrix itself. The experiments' results demonstrated that the rotation-invariant estimator is more suitable for dealing with large dimension datasets than the eigenvalue clipping methods and can be significantly improved over the sample covariance matrix as the data size grows, but it did not perform well on a small sample data. In a recent paper [33], Deshmukh et al. combined the shrinkage transformation with the eigenvalue clipping to obtain the estimator of the covariance matrix for the convex combination of the optimal parameters. This estimator can achieve less out-of-sample risk in portfolio optimization for small datasets.

The research in this paper was mainly motivated by [32,33], and the novelties of this study are as follows:


The rest of this paper is organized as follows: Section 2 describes the related work of covariance matrix estimation. Section 3 introduces our new proposed estimator and its application. Section 4 implements the numerical simulation and empirical research. Section 5 gives the conclusions.

#### **2. Preliminaries**

#### *2.1. The Rotation-Invariant Estimator*

First, we briefly introduce the basic idea of the rotation-invariant estimator. For more details, we refer to [32].

Let *r* = (*r*1,*r*2, ...,*rN*) denote a *T* × *N* matrix of *T* independent and identically distributed (iid) observations on a system of *N* random variables with mean vector *µ*. *N* and *T* denote the number of variable and the size of the variable, respectively. In this case, the sample covariance matrix is given by

$$\Sigma\_{\text{SCM}} = \left(\sigma\_{ij}\right) = \frac{1}{N-1} \sum\_{i=1}^{N} (r\_i - \mu) \left(r\_i - \mu\right)' \tag{1}$$

Let *N* and *T* be asymptotic in the high-dimensional regime, i.e.,

$$N \asymp T. \tag{2}$$

In addition, the concentration ratio is given by

$$
\mathcal{L} = \frac{N}{T}.\tag{3}
$$

The construction steps of the rotation-invariant estimator are as follows:

• **Step 1**: Calculate the Stieltjes transform of the empirical spectral measure of *S*<sup>1</sup> from

$$s(z) = \frac{1}{T} \text{Tr}(\mathbf{S}\_1 - z)^{-1} \tag{4}$$

where *z* denotes the spectral parameter and

$$S\_1 = \frac{1}{N-1} \sum\_{i=1}^{N} (r\_i - \mu)^{'} (r\_i - \mu) \tag{5}$$

The function (4) contains all the information about the eigenvalues of the matrix *S*1, which has the same nonzero eigenvalues as Σ*SCM*.

• **Step 2**: Update (4) based on the nonzero eigenvalues of *Y* 0*Y* and *YY*0 , i.e.,

$$s(z) = \frac{1}{T} \text{Tr}(\mathbf{S}\_1 - z)^{-1} \,\prime \tag{6}$$

where *y<sup>i</sup>* = *r<sup>i</sup>* − *µ*, *Y* = (*y*1, *y*2, ..., *yN*), and *λ<sup>i</sup>* denotes the *i*th eigenvalue of the sample covariance matrix Σ*SCM*.

• **Step 3**: Calculate the function *δ*b *<sup>i</sup>* of the *i*th eigenvalue of *S* from

$$
\widehat{\delta}\_{\dot{l}} = \frac{1}{\lambda\_{\dot{l}} |s(\lambda\_{\dot{l}} + i\eta)|^2}. \tag{7}
$$

where *s*(·) is the empirical Stieltjes transform from (6) and parameter *η* = *T* − 1 2 .

• **Step 4**: Output the resulting covariance matrix estimator Σ*RIE* from

$$
\Sigma\_{RIE} = \mathcal{U}\_N \hat{D}\_N \mathcal{U}\_{N'}^{'} \tag{8}
$$

where

$$
\hat{D}\_N = \text{Diag}(\hat{\lambda}\_1, \hat{\lambda}\_2, \dots, \hat{\lambda}\_N), \tag{9}
$$

*U<sup>N</sup>* is an orthogonal matrix, whose columns [*u*1, *u*2, ..., *uN*] are the corresponding eigenvectors, with the eigenvalue of the rotation-invariant estimator defined by

$$
\widehat{\lambda\_i} = \frac{\sum\_{i=1}^{N} \lambda\_i}{\sum\_{i=1}^{N} \widehat{\delta\_i}} \widehat{\delta\_i}. \tag{10}
$$

One can easily verify that

$$\sum\_{i=1}^{N} \widehat{\lambda\_i} = \sum\_{i=1}^{N} \lambda\_i. \tag{11}$$

This implies that the estimator Σ*RIE* has the same trace as the sample covariance matrix. More literature reviews on rotation-invariant estimators are presented in the Table 1.



#### *2.2. Improved Covariance Estimator Based on Eigenvalue Clipping*

Deshmukh et al. [33] introduced an improved estimation based on eigenvalue clipping, which takes the optimal parameters in the convex combination of the sample covariance matrix Σ*SCM*, the shrinkage target Σ*<sup>F</sup>*

$$
\Sigma\_F = (f\_{\text{ij}})\_\prime \tag{12}
$$

with

$$f\_{\vec{ij}} = \begin{cases} \frac{2\sqrt{\sigma\_{\vec{li}}\sigma\_{\vec{j}\vec{j}}}}{N(N-1)} \sum\_{i=1}^{N-1} \sum\_{j=i+1}^{T} \frac{\sigma\_{\vec{li}}}{\sigma\_{\vec{li}}\sigma\_{\vec{j}\vec{j}}}, & \mathbf{i} \neq \mathbf{j}\_{\prime} \\\\ \sigma\_{\vec{li}\prime} & \mathbf{i} = \mathbf{j}\_{\prime} \end{cases} \tag{13}$$

and the matrix Σ*MP* obtained by applying eigenvalue clipping.

Let *y<sup>i</sup>* = *r<sup>i</sup>* − *r<sup>i</sup>* be independent, identically distributed, random variables with finite variance *σ*. The Marchenko–Pastur density *ρ*Σ*SCM* (*λ*) of the eigenvalues of Σ*SCM* is defined by

$$
\rho\_{\Sigma\_{\text{SCM}}}(\lambda) = \frac{1}{N} \frac{dn(\lambda)}{d\lambda}.\tag{14}
$$

where *n*(*λ*) is the number of eigenvalues of the sample covariance matrix Σ*SCM* less than *λ*. In the condition of the limit *<sup>N</sup>* <sup>→</sup> <sup>∞</sup>, *<sup>T</sup>* <sup>→</sup> <sup>∞</sup>, and <sup>1</sup> *<sup>c</sup>* ≥ 1, the density follows from (14):

$$\rho\_{\Sigma\_{\text{SCM}}}(\lambda) = \frac{1}{2\pi c\sigma^2} \frac{\sqrt{(\lambda\_{\text{max}} - \lambda)(\lambda - \lambda\_{\text{min}})}}{\lambda},\tag{15}$$

where

$$
\lambda\_{\max} = \sigma^2 (1 + c + 2\sqrt{c}), \quad \lambda\_{\min} = \sigma^2 (1 + c - 2\sqrt{c}). \tag{16}
$$

[*λmin*, *λmax*] represents the MP law bounds. In this case, the covariance matrix can be cleaned by scaling the eigenvectors of Σ*SCM* with these new eigenvalues. Σ*MP* is obtained by this method.

Let Σ be the population covariance matrix; the optimal parameters in convex combination can be found from the following optimization problem [33].

$$\min\_{\theta, \phi} \quad \left||\Sigma - \Sigma\_{\text{est}}\right||\_F \tag{17}$$

$$\begin{cases} \Sigma\_{\rm est} = \phi(\theta \Sigma\_{\rm F} + (1 - \theta)\Sigma\_{MP}) + (1 - \phi)\Sigma\_{\rm SCM} \\ 0 \le \theta \le 1, 0 \le \phi \le 1. \end{cases} \tag{18}$$

where *θ* and *φ* are the parameters of the convex combination.

Usually, the eigenvectors of the sample covariance matrix deviate from those of the population covariance matrix under large-dimensional asymptotics. Correcting the deviation of the eigenvalues of the sample covariance matrix can improve the performance of the large covariance matrix. Although the estimation can adapt to changing the sampling noise conditions by performing parameter optimization, the performance of the estimation outperforms other estimations only for small-dimensional problems.

#### **3. Proposed Estimator and Application in Portfolio Optimization**

#### *3.1. Proposed Estimator*

For further improve the performance of the large covariance matrix, we replaced the eigenvalues falling inside Marchenko–Pastur (MP) law bounds with the rotation-invariant estimator Σ*RIE* and applied the linear shrinkage estimation to shrink the eigenvalues falling outside the MP law bounds in this paper. Our new estimation is presented below.

$$\min\_{\theta, \phi} \quad \left||\Sigma - \Sigma\_{\text{est}}\right||\_F \tag{19}$$

$$\begin{cases} \Sigma\_{\text{est}} = \phi(\theta \Sigma\_F + (1 - \theta)\Sigma\_{RIE}) + (1 - \phi)\Sigma\_{\text{SCM}\prime} \\ 0 \le \theta \le 1, 0 \le \phi \le 1, \end{cases} \tag{20}$$

Thus, the estimation of the covariance matrix is given by

$$
\Delta^\* = \phi^\*(\theta^\* \Sigma\_F + (1 - \theta^\*)\Sigma\_{RIE}) + (1 - \phi^\*)\Sigma\_{SCM}.\tag{21}
$$

where *θ* ∗ and *φ* ∗ are the optimal parameters of the optimization problem given by (19) and (20).

It is well known that the financial data are heavy-tailed and non-normal [7,33]. However, the existing covariance matrix estimation methods generally requires the assumption of normality [32,38]. For overcoming this drawback, we propose a new estimator, which is a convex combination of the linear shrinkage estimation and the rotation-invariant estimator

under the Frobenius norm. One advantage of the new estimation is that we can remove the noise caused by the bulk eigenvalues and the extreme eigenvalues in the financial data. Furthermore, we set five-fold cross-validation *k* = 5 to implement the simulation and empirical research for improving the accurate estimation of the covariance matrix.

The detailed steps of our new estimation are as follows.


$$\Delta\_k^{(\theta\_i, \phi\_j)} = ||\Sigma\_{est}^{(\theta\_i, \phi\_j)} - \Sigma||\_{F\prime}$$

for *θ<sup>i</sup>* , *φ<sup>j</sup>* , *i* = 1, 2, ..., *M*, *j* = 1, ..., *P*, and let *k* = *k* + 1.


$$
\Delta\_{\text{sum}} = \sum\_{k=1}^{5} \sum\_{i=1}^{M} \Delta\_k^{(\theta\_i, \phi\_j)}.
$$

• **Step 6**: Output the proposed estimation Σ ∗ from (21).

#### *3.2. Minimum Variance Portfolio Optimization*

According to Markowitz's theory [39], we included an additional return constraint in the portfolio because even a risk-averse investor would expect a minimal positive return. The classic portfolio optimization model that satisfies the minimum expected return is defined by

$$\min\_{\mathbf{x}} \quad \mathbf{x}^{\prime} \boldsymbol{\Sigma} \mathbf{x} \tag{22}$$
 
$$\begin{cases} \mathbf{1}^{\prime} \mathbf{x} = \mathbf{1}, \end{cases}$$

$$\text{s.t.} \begin{cases} 1 \,\text{x} = 1, \\ r' \,\text{x} \ge r\_{\text{min}}, \\ \text{x} \ge \text{0}, \end{cases}$$

where *x*, *r*, and *rmin* represent the weight of portfolio optimization, daily return, and the minimum daily expected return, respectively. It is well known that the portfolio selection is widely used in the financial field, which is a convex quadratic programming problem [39].

In the portfolio optimization, the weight of each asset is closely related to the covariance matrix. An accurate covariance matrix can achieve a more reasonable weight distribution and better portfolio effect. Due to the heavy-tailed nature of financial data and the availability of limited samples [8], many studies started concentrating on the global minimum variance (GMV) portfolio. To improve the performance of the sample covariance matrix in the portfolio optimization, DeMiguel [40] added the additional constraint and regularizing asset weight vector into the minimum variance portfolio and showed that the estimator always leads the constructed portfolio to achieve a smaller variance and a higher Sharpe ratio than other portfolios. Furthermore, Ledoit and Wolf [18] applied the estimation to the portfolio optimization to overcome the dimension and noise problems of a high-dimensional covariance matrix, and the results were better than the linear shrinkage estimation [13]. Moreover, due to the influence of financial market information on covariance matrix estimation, the time-varying covariance matrix or the correlation matrix also have practical significance in portfolio optimization. For more details, refer to [41] and the references therein.

In this paper, we divided the sample data into in-sample data and out-of-sample data, which were used for the estimation and prediction of the covariance matrix, respectively. To measure the out-of-sample performance of the estimation of the covariance matrix in portfolio optimization, we used the out-of-sample risk, the average return, and the Sharpe ratio as the criteria of the measurement. The average return was annualized by multiplying it by 252 (252 trading days per year), and the standard deviation was annualized by multiplying it by <sup>√</sup> 252. The out-of-sample performance of the portfolio model was evaluated through the following procedure.


$$
\sigma\_{out} = var((\mathfrak{x}^\*)'r\_{out})\_{\prime\prime}
$$

the average return:

$$r\_{\text{ave}} = E((\mathfrak{x}^\*)^' r\_{\text{out}})\_{\prime \prime}$$

and the Sharpe Ratio:

$$\mathcal{SR} = \frac{r\_{\text{ave}} - r\_f}{\sigma\_{\text{out}}}$$

where *x* <sup>∗</sup> and *r<sup>f</sup>* represent the optimal weight vector and the risk-free interest, respectively.

#### **4. Numerical Simulation and Empirical Research**

#### *4.1. Numerical Simulation*

In the simulation, we used the simulation data of Engle et al. [41], and the mean return ranged between −0.0031 and 0.0036. We divided the dataset into in-sample data and out-of-sample data, and both the sample sizes were *T* = 500. In pursuit of accuracy, we implemented the five-fold cross-validation, and the parameter selection criterion was the *F*-norm of the estimator and the population covariance matrix. We set three dimensions for the return series, which were *N* = 100, 200, and 400, respectively. The maximum concentration ratio is

$$
\varepsilon = \frac{N}{T} = \frac{400}{500} = 0.80.\tag{23}
$$

To measure the performance of the estimators, we compared the error between each estimator and the population covariance matrix. The six estimators are shown in Table 2.

**Table 2.** The estimators for comparison.


In the five-fold cross-validation, we obtained the error between the proposed estimation and the population covariance matrix for the different parameters *θ* and *φ* under

three asset dimensions. In Figures 1–3, the horizontal and longitudinal axis represent the different values of *θ* and *φ*, respectively, and the vertical axis represents the sum of the error. It is obvious that there are two optimal parameters to minimizing the error between the proposed estimator and the population covariance matrix for all *θ* and *φ*. The results are shown in Table 3. To some degree, this ensures the effectiveness of the proposed estimation.

Table 4 shows that the *F*-norm error of our new estimation is the smallest in the ones of the six estimations. Under this premise, we calculated the portfolio variance in the minimum variance portfolio that satisfies the minimum 0.0015 expected return. The results are shown in Table 5. Figures 4–6 show the mean return of out-of-sample data ranging from the 1st asset to the 400th asset. We mark individual points on the graph. The horizontal and longitudinal axis represent the order of assets in the total assets and the mean return of this asset, respectively. We can see that the mean return of the point that is marked is relatively high. Generally speaking, higher asset returns will also face relatively large investment risks. To understand the following description, we divided the return into three asset risk grades: high (*rmin* ≥ 0.001), middle (0.0005 ≤ *rmin* ≤ 0.001), and low (*rmin* < 0.0005), respectively. In Table 5, it is obvious that the variance of Σ*Identity* is the largest in all asset dimensions. In the case of *N* = 200, the asset weights of the portfolio model corresponding to Σ*Identity* are distributed on the 15th, 51st, 71st, 75th, and 138th assets in Figure 7, respectively, with 87% high-risk assets and 13% medium assets. However, in the case of *N* = 400, the asset weights of the portfolio model corresponding to Σ*Identity* are distributed on twenty assets, with 60% high-risk assets, 39.5% medium assets, and only 0.5% low-risk assets, and we can see that the high-risk and medium-risk assets account for the vast majority of the 20 assets. Instead, the portfolio model corresponding to our new estimator Σ ∗ distributed the weights on high-risk assets and medium-risk assets as 69% and 26.12%, respectively, to achieve the 0.0015 expected return. The remaining 5% was distributed on low-risk assets to reduce investment risk. The corresponding results are shown in Figures 8 and 9. Overall, the reasonable distribution of asset weights on high-, medium, and low-risk assets can appropriately decrease investment risks. As the number of the assets increased, the performance of our new estimator Σ ∗ became better. At the same time, the proposed estimation in the minimum variance portfolio was more dispersed on the allocation of the assets.


**Table 3.** The optimal parameters *θ* and *φ* in convex combination and the sum of the corresponding error.

**Table 4.** The error between each estimator and the population covariance matrix under the optimal parameter.


**Table 5.** The variance comparison of six estimations in the minimum variance portfolio.


\* denotes the unit is 10−<sup>4</sup> .

**Figure 1.** The sum of the error of five-fold cross-validation between the proposed estimation and the population covariance matrix under the different *θ* and *φ* for *N* = 100.

**Figure 2.** Thesum of the error of the five-fold cross-validation between the proposed estimation and the population covariance matrix under the different *θ* and *φ* for *N* = 200.

**Figure 3.** The sum of the error of the five-fold cross-validation between the proposed estimation and the population covariance matrix under the different *θ* and *φ* for *N* = 400.

**Figure 4.** The mean return of the out-of-sample data for *N* = 100.

**Figure 5.** The mean return of the out-of-sample data range from the 101st asset to the 200th asset.

**Figure 6.** The mean return of the out-of-sample data range from the 201st asset to the 400th asset.

**Figure 7.** The assets' weights of each estimation under the out-of-sample data *N* = 200.

**Figure 8.** The assets' weights of each estimation under the out-of-sample data for *N* = 400.

**Figure 9.** The assets' weights of the identity matrix under the out-of-sample data *N* = 400.

#### *4.2. Empirical Research*

The data of this paper came from the component stock of CSI500, HS300, and SSE50 on the tushare financial website. The whole period of the samples was from 24 May 2017 to 1 July 2021. Removing the missing data of the samples from transaction, we finally obtained 426 component stocks of CSI500, 218 component stocks of HS300, and 41 component stocks of SSE50.

In this paper, we set *T*<sup>1</sup> = 500 and *T*<sup>2</sup> = 500 as the window of estimation and prediction, respectively. The maximum concentration ratio is

$$
\sigma = \frac{N}{T\_1} = \frac{426}{500} = 0.852.\tag{24}
$$

We used the log return as we studied the object and divide all samples into two parts for estimation and prediction. We constructed six portfolio optimization models by using the estimator in Table 2. The procedures of estimation and prediction are as follows.


In portfolio optimization, the reduction of volatility at the first decimal place is also considered to be quite significant [13,33]. Tables 6 and 7 show the performance of the six portfolio optimization model based on the different asset dimensions. For out-ofsample data, we used the standard deviation as the performance metric. Furthermore, we calculated the average return and the Sharpe ratio in the portfolio optimization model (22) and the risk-free interest was set to 1.75%.

Tables 6–8 show that the average returns of each estimator are equal. Table 6 shows that the standard deviation of the portfolio optimization model corresponding to Σ*Identity* is only 27.53%, which is the smallest in each estimator. The standard deviation of the portfolio optimization model corresponding to our new estimator Σ ∗ is 27.97%, and its performance ranks fourth among all estimators.

Table 7 shows the performance of the portfolio optimization model corresponding to each estimator with 218 assets. It can be seen that the performance of the estimator Σ*Identity* becomes weak. In this case, Σ*Identity* is the worst estimator, which is expected as it assumes zero correlations among stocks, and Σ*SCM* is the second-worst. The Sharpe ratio of the portfolio optimization corresponding to Σ*NL* is the highest in each estimator followed by the one of Σ*D*. At the same time, the performance of our new estimator Σ ∗ ranks third among all estimators. Comparing to the case of *N* = 218, the performance of our new estimator improved as the asset dimension increased.

Table 8 compares the performance of each model with the number of assets of 426. The result shows that the portfolio optimization model corresponding to our new estimator Σ ∗ obtaining the smallest standard deviation leads to the highest Sharpe ratio. Obviously, compared with other estimators, especially with Σ*Identity*, our new estimator has a significant decrease in the out-of-sample standard deviation. Meanwhile, this also implies that the performance of our new estimator Σ ∗ is superior to the ones of the other five estimators as the asset dimension increases.

The homogeneity of the variance test is a metric to measure whether the variances of the two investment strategies are equal, and we used the improved bootstrap inference [44] to test the significant variance difference between Σ ∗ and other estimations excess returns. Table 9 shows that the test between Σ ∗ and the alternative methods all reject the null hypothesis of equal variances. Moreover, the sample variance of the excess return generated corresponding to our new estimator Σ ∗ is significantly lower than the other portfolio optimization models as the number of assets increases. Therefore, our new estimation Σ ∗ is superior to other estimations.


**Table 6.** The out-of-sample performance comparison between each estimator of the 41 assets in SSE50.

∗ denotes the unit is %.

**Table 7.** The out-of-sample performance comparison between each estimations of the 218 assets in HS300.


∗ denotes the unit is %.

**Table 8.** The out-of-sample performance comparison between each estimations of the 426 assets in CSI500.


∗ denotes the unit is %.

**Table 9.** The difference in the out-of-sample variance between Σ ∗ and the alternative estimation with all assets (the significance level is 5%).


#### **5. Conclusions**

In this paper, we proposed a new estimator for the covariance matrix, which is a convex combination of the linear shrinkage estimation and the rotation-invariant estimator under the *F*-norm. We first obtained the optimal parameters through considerable numerical operations, and then, we focused on the accuracy of the model and ignored the complexity of the calculation. Moreover, we demonstrated the effectiveness of the model in the simulation. Finally, we applied our estimation to the minimum variance portfolio

optimization and showed that the performance of the proposed estimator is superior to the other five existing estimators in the portfolio optimization for high-dimensional data.

In addition, we only considered the sample noise on the covariance matrix in this article, but in the financial field, the covariance matrix estimation will vary with time due to the influence of market information, and the covariance matrix estimation will be affected by the market information. Therefore, the performance of our new estimation in the dynamic conditional correlation model [41] can be investigated as part of future work for a large-dimensional covariance matrix.

**Author Contributions:** Conceptualization, Y.Z.; Fund acquisition, G.W. and Z.Y.; methodology, Y.Z. and J.T.; supervision, G.W. and Z.Y.; writing—original draft preparation, Y.Z.; writing—review and editing, Y.Z., J.T. and G.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (Nos. 11971302 and 12171307).

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data of this paper came from the component stock of CSI500, HS300, and SSE50 on the tushare financial website accessed on 1 July 2021: https://www.tushare.pro.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Ahmed Hamza Osman \* and Hani Moaiteq Aljahdali**

Department of Information System, Faculty of Computing and Information Technology in Rabighn, King Abdulaziz University, Jeddah 21911, Saudi Arabia

**\*** Correspondence: ahoahmad@kau.edu.sa

**Abstract:** Plagiarism is an act of intellectual high treason that damages the whole scholarly endeavor. Many attempts have been undertaken in recent years to identify text document plagiarism. The effectiveness of researchers' suggested strategies to identify plagiarized sections needs to be enhanced, particularly when semantic analysis is involved. The Internet's easy access to and copying of text content is one factor contributing to the growth of plagiarism. The present paper relates generally to text plagiarism detection. It relates more particularly to a method and system for semantic text plagiarism detection based on conceptual matching using semantic role labeling and a fuzzy inference system. We provide an important arguments nomination technique based on the fuzzy labeling method for identifying plagiarized semantic text. The suggested method matches text by assigning a value to each phrase within a sentence semantically. Semantic role labeling has several benefits for constructing semantic arguments for each phrase. The approach proposes nominating for each argument produced by the fuzzy logic to choose key arguments. It has been determined that not all textual arguments affect text plagiarism. The proposed fuzzy labeling method can only choose the most significant arguments, and the results were utilized to calculate similarity. According to the results, the suggested technique outperforms other current plagiarism detection algorithms in terms of recall, precision, and F-measure with the PAN-PC and CS11 human datasets.

**Keywords:** similarity; plagiarism; semantic; SRL; fuzzy labeling

**MSC:** 68P20; 68P10; 63E72; 68U15

#### **1. Introduction**

The evolution of, and rapid access to information through, the Internet has contributed to various data protection and ethical problems. "The act of using another person's words or ideas without giving credit to that person" is known as plagiarism. It can generally be considered as anything from basic copy-paste, in which information is simply copied, to higher levels of complexity, in which the text is distorted by sentences, translations, idea adoptions, etc. [1]. Plagiarism could be considered to be more versatile than simple, and more nuanced than trivial copying and pasting. There have typically been the following different forms of plagiarism: straight-line plagiarism, basic footnote pestilence, nuanced footnote plagiarism, plagiarism, quotation-free plagiarism and paraphrasing. The plagiarism problem includes plagiarized media, magazines and Internet tools. Longitudinal research has been undertaken in order to show students' secret patterns of plagiarism, and to analyze academics' experiences of plagiarism. On the other hand, detection of plagiarism tasks can narrowly be divided into two, namely extrinsic detection and intrinsic detection [2–4]. In extrinsic detection, the suspected document is compared with a sample that is either offline or online, whereas if the suspected document is detected internally, structural and stylometric information is used to evaluate this document, which is inserted into a report without any record of a reference source. Many online plagiarism inspectors

**Citation:** Osman, A.H.; Aljahdali, H.M. Important Arguments Nomination Based on Fuzzy Labeling for Recognizing Plagiarized Semantic Text. *Mathematics* **2022**, *10*, 4613. https://doi.org/10.3390/ math10234613

Academic Editors: Xiang Li, Shuo Zhang and Wei Zhang

Received: 20 October 2022 Accepted: 30 November 2022 Published: 5 December 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

use a method that normally consists of World Wide Web surveys, and studies indicate that the most commonly available tools to detect plagiarism cannot detect structural changes and common paraphrases imposed by the users who plagiarise [3]. Our empirical research has shown that university teachers want computerized approaches to detect the plagiarism of ideas. The quality of various academic activities, including theses, dissertations, journal articles, congresses, essays, assignments and so on, is crucial to assess.

A method called paraphrasing may be used to change the organization of a statement or swap some of the original words with synonyms. It is also plagiarism if there are no any correct citations or quotation marks. Due to the changes in fingerprints between the original and copied documents, the approaches utilized in existing detection technologies are unable to identify the use of plagiarism. These cases are far more difficult to identify, since semantic plagiarism is frequently a blurry process that is challenging to find and even more challenging to curtail. One of the key problems in the field of plagiarism detection is how to effectively and efficiently distinguish between plagiarized and non-plagiarized papers [5–7].

Some pirated studies such as [8] at least mention the original version. However, manually checking for plagiarism in a suspicious paper is a very challenging and timeconsuming task involving various source materials [9]. It is therefore thought of as a big advance in this regard to use computer systems that can perform the procedure with the least amount of user interaction. The technologies for detecting plagiarism that have been proposed so far are highly capable of catching various types of plagiarism; however, detecting whether plagiarism is present in a text relies on human monitoring [10].

Comparing the copied document with the original document is a common practice in plagiarism detection approaches. Character-matching strategies can be used to identify either completely or partially identical strings. The currently used technique for paraphrasing acquisition uses machine learning and crowdsourcing [11]. This approach focuses on the following two issues: gathering data through crowdsourcing and gathering samples at the passage level. The crowdsourcing paradigm is ineffective without automatic quality assurance and, without crowdsourcing, the cost of building test corpora is too high for practical use. Additionally, a citation-based approach is applied. This technique is employed to detect academic texts that have been read and utilized without citation [11]. The current work offers a method for human semantic plagiarism detection based on conceptual matching and arguments nomination using a fuzzy labeling method, which detects plagiarism through copy-and-paste, rewording or synonym replacement, changing word structure in sentences, and changing sentences from the passive to the active voice and vice versa.

Research addressing the automated detection of suspected plagiarism instances falls under the category of plagiarism detection methods. Methods for examining textual similarity at the lexical, syntactic, and semantic levels, as well as the similarity of non-textual content elements such as citations, illustrations, tables, and mathematical equations, are frequently presented in studies. Research that addresses the evaluation of plagiarism detection algorithms, such as by offering test sets and reporting on performance comparisons, was also examined, as it focuses mostly on gap filling. Studies on the prevention, detection, prosecution, and punishment of plagiarism in educational institutions fall under the category of plagiarism policy. This research analyzed the occurrence of plagiarism at institutions, examined student and teacher attitudes about plagiarism, and discussed the effects of institutional rules.

This research is interrelated and necessary to conduct a thorough analysis of the phenomenon of academic plagiarism. Without a strong structure that directs the investigation and documentation of plagiarism, using plagiarism detection tools in practice will be useless. Research and development efforts for enhancing plagiarism detection methods and systems are guided by the information gained from examining the application of plagiarism detection systems in practice.

In order to keep up with the behavior shifts that plagiarists typically demonstrate when faced with a higher chance of getting caught due to improved detection technologies and harsher techniques, ongoing study is required. Thishis study is one of the methods used to bridge the research gaps in the field of text theft and plagiarism.

The remainder of the sections are as follows: the above-described study on plagiarism detection is described in Section 2. Fuzzy logic is the subject of Section 3. Section 4 provides a detailed explanation of the method's basic concept. In our suggested strategy, we employed an experimental design that is described in Section 5. Section 6 presents the corpus and dataset, as well as similarity detection, and the results and discussion are in Section 7.

#### **2. Related Works**

There are two stages to detecting plagiarism: source document retrieval (also known as candidate retrieval) and comprehensive comparison between the source document and the document under examination. In the last five years, many researchers have focused on the retrieval of sources and presented solutions for it, because of the recent breakthroughs in plagiarism detection.

Recently, two approaches to recognizing extrinsic plagiarism were suggested by Arabi and Akbari [12]. Both approaches use two steps of filtering, based on the bag of words (BoW) technique at the document and sentence levels, and plagiarism is only looked into in the outputs of these two stages, in order to reduce the search space. Semantic matrices and two structural are created using a mix of the WordNet ontology and the weighting TF–IDF methodology, as well as the pre-trained network method of words embedding Fast Text. Then, the TF–IDF weighting method is used in the second technique to detect similarities in suspicious documents and sentences.

Research [13] has found that accessing plagiarism sources using external knowledge base sources increased semantic similarity and contextual importance. Other than examples where the text had been duplicated verbatim, the researchers employed a closest neighbor search and support vector machine to find potential candidates for other sorts of plagiarism. Using encoded fingerprints to create queries, a researcher has presented candidate retrieval for Arabic text reuse from online pages and provided the optimal selection of source documents [14]. Cross-lingual candidate retrieval utilizing two-level proximity information was suggested, in addition to prior work on candidate retrieval from the same language. With the suspect (or query) document segmented using a topic-based segmentation algorithm, the researchers next utilized a proximity-based model to find sources related to the segmented portions of the suspicious document. There is still room for improvement in the second phase of plagiarism detection, according to a study of the current trends in plagiarism detection research [12,15,16]. More languages and machine learning approaches need to be explored in the field of cross-language plagiarism detection, as shown by recent studies [17–19].

The detection of disguised plagiarism has been the subject of many studies [20–22]. WordNet-combined semantic similarity metrics were utilized to identify highly obfuscated plagiarism instances in handwritten paraphrases and simulated plagiarism cases [5–7,23–25]. Adding an intermediary phase between candidate retrieval and comprehensive analysis allowed for visual assessment of highly obfuscated plagiarisms, and this additional step included an expanded Jaccard measure to deal with synonyms/hypernyms in text fragments [5]. Studies have examined and evaluated approaches using both content-based and citation-based plagiarism detection in academic writing [26]. Citations and references were shown to be an effective addition to existing plagiarism detection techniques. Document plagiarism detection has been studied as a binary classification task in recent works [27]. Naive Bayes (NB), support vector machine (SVM), and decision tree (DT) classifiers have been used to determine whether or not suspicious-source document pairings included plagiarism [28]. Part of speech (POS) and chunk features were used to extract monolingual features from text pairings, concentrating on modest yet effective syntax-based features. When compared to traditional baselines, the suggested classifiers were shown to be more accurate in detecting plagiarism in English texts [28]. Genetic algorithms (GA) were utilized to identify disguised plagiarism in the form of summary texts using syntactic and semantic aspects. An algorithm based on the GA method was used to extract concepts at the sentence level [29,30]. Syntactic and semantic elements from the WordNet lexical database were used to include two detection levels, document-level and passage-level [30]. It was shown that a combined syntactic–semantic metric that incorporates additional characteristics such as chunking and POS tagging, as well as semantic role labeling (SRL) and its POS tagging variant, may better distinguish between various kinds of plagiarism. When it comes to spotting veiled plagiarism in a monolingual situation, deeper linguistic traits take center stage.

Paraphrasing is a technique that modifies or replaces some of the original words by their synonym, by changing the structure of the sentence. It is also considered to be plagiarism without a correct citation or quotation marks. Due to variation in the finger printouts between the original and plagiarism files, methods used in existing detection tools cannot be detected as described above. Such cases are much more difficult for people to spot, as linguistic plagiarism is often a smooth process that is difficult to find and more difficult to stop, as it often crosses international borders. Due to the plagiarism issue, there have been a number of arguments, including intellectual property (IP), ethics, legal restrictions and copyright. Intellectual property (IP) is a legal right to the production of the mind, creative and economic, as well as relevant legal fields. In particular, plagiarism is deemed wrong in a moral context, because the plagiarist takes the original author's ideas and contents and tries to deny the author's contribution, by failing to include proper citations or quotations. More legal restrictions are therefore necessary if the original author is to be able to claim their specific rights in respect of their new invention or function. There are many kinds of plagiarism, including copying and pasting, reprocessing and paraphrasing the text, plagiarism of ideas and plagiarism by converting one language to another. Plagiarism is one of the serious problems in education. The discrepancies in fingerprints between the original and the plagiarized material prevent existing detection technologies from detecting plagiarism. Semantic plagiarism is typically a hazy process that is hard to look for and even more difficult to stop, since it generally crosses international borders. The number of arguments picked up by using the fuzzy inference system(FIS) is greater than that detected using the argument weight method in [31]. Fuzzy logic may also tackle the issue of uncertainty in argument selection that affects plagiarized users. One of the most difficult challenges in the world of plagiarism detection is accurately distinguishing between plagiarized and non-plagiarized content. Plagiarism detection software currently uses character matching, n-grams, chunks, and keywords to find inconsistencies. A novel approach of detecting plagiarism is proposed in this paper. Based on semantic role labeling and fuzzy logic, these approaches will be likely to be used in the future.

Natural language processing approaches such as semantic role labeling (SRL), text clustering [32] and text classification [33] have all made use of SRL [34]. Osman et al. have proposed an improved plagiarism detection method based on SRL [5]. The suggested approach was taught to examine the behavior of the plagiarized user using an argument weighting mechanism. Plagiarism detection is not affected by all arguments. Using fuzzy rules and fuzzy inference systems, we are trying to identify the most essential points in a plagiarized text. Fuzzy logic, a kind of approximation reasoning, is a strong tool for decision support and expert systems. It is possible that most of human thinking is based on fuzzy facts, fuzzy connectives, and fuzzy rules of inference [2]. The *t*-test significance procedure was used to demonstrate the validity of the findings acquired utilizing the new method's fuzzy inference system.

The main contributions and goal of the proposed method is a thorough plagiarism detection technique that focuses on many types of detection, including copy–paste plagiarism, rewording or synonym replacement, changing word structure in sentences, and switching from the passive to the active voice and vice versa. The SRL was utilized to

perform semantic analysis on the sentences. The concepts or synonyms for each word in the phrases were extracted using the WordNet thesaurus. These three points are the main differences between our proposed method and other techniques. The second aspect is the comparison process. Whereas prior approaches have concentrated on conventional comparison techniques such as character-based and string matching, our suggested method uses the SRL as a method of analysis and comparison to capture plagiarized meaning of a text. The crucial aspect of this variation is an increase in our suggested method's similarity score employing the fuzzy logic algorithm, where none of the proposed approaches have ever been employed before.

#### **3. Fuzzy Logic System**

Many prediction and control systems, fuzzy knowledge management systems, and decision support systems have shown success with the fuzzy logic system [35–37]. For confusing and obscure information, it is often utilized. The connection between inputs and intended outputs of a system may be determined using this technique. Assumptions and approximations may be taken into account while making a choice using it. A defuzzification method and a set of predetermined rules are part of a fuzzy inference system.

Mamdani employed fuzzy logic to regulate a modest laboratory steam engine, which was first described by Zadeh [38]. It is possible to obtain decision-making models in linguistic terms, thanks to an assumption in mathematics about ambiguous reasoning. For many applications and complex control systems, fuzzy logic has recently emerged as one of the most effective methods. More than 1000 industrial commercial fuzzy applications have been successfully created in the last several years, according to Munakata and Jani [39]. Fuzzy logic's distinctive properties are at the heart of the curreendeavour.

The theory of fuzzy sets offers a foundation for the use of fuzzy logic. It became necessary to adapt traditional logic to cope with partial truths, because of its inability to deal with just two values, true and false (neither completely true nor completely false). Thus, the fuzzy logic is an extension of classical logic by generalizing the classical logic inference rules, which are capable of handling approximate reasoning. Each member of a "fuzzy set" has a degree of membership in that set, which is defined by a membership function, which is an extension of the standard "crisp set." Members of the target set are assigned a membership degree between zero and one by the membership function, which allocates a membership degree to each member of the target set [35]. Based on a set of fuzzy "IF–THEN" principles, the computer can convert language statements into actions. Conditions are linked with actions in "if A then B" fuzzy IF–THEN rules, with "if A then B" being the most common version. In the construction and modification of fuzzy logic, the rules are easily understood and simple to alter, to add new rules or to delete current rules.

By applying a membership function to the fuzzy sets of linguistic words, input values are converted into degrees of membership (in the (0;1) range). As shown in Equation (1), the equation concerning *xik*(*x<sup>i</sup>* ) may be represented by a fuzzy set, which is obtained by holding certain variables constant *µ<sup>i</sup>* and then transforming that set into a fuzzy one [40].

$$A = \frac{\mu\_i k}{\varkappa\_i} \mathbf{x}\_i \in X \tag{1}$$

There are fuzzy sets *A* and *X* in the world of discourse, and values vary from 1 to 0 in the fuzzy set.

In the context of fuzzification, *k*( ) is referred to as the kernel. "*A*" is the fuzzified version of "*A*".

$$A = \mu\_1 k(\mathbf{x}\_1) + \mu\_2 k(\mathbf{x}\_2) + \dots + \mu\_n k(\mathbf{x}\_n) \tag{2}$$

To execute fuzzy reasoning, the inference element of a fuzzy system combines facts collected via the fuzzification process with a set of production rules [40]. The FIS shown in Figure 1.

in Figure 1.

by fuzzers.

11 2 2 ( ) ( ) ... ( ) *A kx kx kx* = + ++

To execute fuzzy reasoning, the inference element of a fuzzy system combines facts

 μ

*n n* (2)

μμ

**Figure 1.** The fuzzy inference system phases. **Figure 1.** The fuzzy inference system phases.

Figure 1 depicts the fuzzy inference system (FIS) phases. The initial stage in the FIS is to use membership functions contained in the fuzzy knowledge base to transform explicit input into linguistic variables. The fuzzy input is then transformed into a fuzzy output by using an IF–THEN type of fuzzy rule. The final step converts the fuzzy output of the inference engine into a clean output using a membership function similar to that used Figure 1 depicts the fuzzy inference system (FIS) phases. The initial stage in the FIS is to use membership functions contained in the fuzzy knowledge base to transform explicit input into linguistic variables. The fuzzy input is then transformed into a fuzzy output by using an IF–THEN type of fuzzy rule. The final step converts the fuzzy output of the inference engine into a clean output using a membership function similar to that used by fuzzers.

#### **4. Proposed Method**

The following are the four key phases of the suggested procedure:

	- The following are the four key phases of the suggested procedure: • Segmentation, stop words removal, and stemming

• Extraction of concepts • The fourth step is fuzzy SRL Figure 2 depicts the suggested method's overall design. The next sections contain more information on each of these stages.

#### Figure 2 depicts the suggested method's overall design. *4.1. Data Preparation*

Preparation of the data included text segmentation, stop word removal, and word stemming. Text was segmented into sentences using text segmentation software [41]. To eliminate pointless words, the stop words removal technique was used. Prefixes and suffixes were also removed using the stemming technique to uncover the base word of a term. These words were culled from the text and the rest were discarded. As a result, there may have been a decrease in the similarity of papers.

Text Segmentation: Natural language processing (NLP) relies heavily on pre-processing. Simple text segmentation is a sort of pre-processing in which text is divided into meaningful chunks. Separating text into individual phrases, words, or themes is a common practice. Steps such as information extraction, semantic role labels, syntax parsing, machine translations, and plagiarism detection all rely heavily on this stage. Boundary detection and text segmentation are used to conduct sentence segmentation. This is the most common way to denote the end of a phrase, using a period (.), exclamation point (!), or question mark (?) [42]. The first phase in our suggested text segmentation approach was sentencebased text segmentation, in which the original and comparison texts were broken down into sentence units. Due to our suggested technique's goal of comparing suspicious text with the source, we decided to utilize this method.

**Figure 2.** Fuzzy semantic approach to detecting plagiarism. **Figure 2.** Fuzzy semantic approach to detecting plagiarism.

The next sections contain more information on each of these stages. *4.1. Data Preparation*  Preparation of the data included text segmentation, stop word removal, and word stemming. Text was segmented into sentences using text segmentation software [41]. To eliminate pointless words, the stop words removal technique was used. Prefixes and suffixes were also removed using the stemming technique to uncover the base word of a term. These words were culled from the text and the rest were discarded. As a result, there may have been a decrease in the similarity of papers. Text Segmentation: Natural language processing (NLP) relies heavily on pre-pro-Stop Words Removal and Stemming Process: Stop words are common occurrences in written materials. Words such as "the", "and", and "a" are examples. As a result of their omission from the index, these keywords have no hint values or meanings associated with their content [43]. According to Tomasic and Garcia-Molina [44], these words account for 40% to 50% of the total text words in a document collection. Automatic indexing may be sped up and index space saved by eliminating stop words, which does not affect retrieval effectiveness [45]. There are a variety of methods for determining stop words, and each has its own advantages and disadvantages. There are a number of English stop word lists now in use in search engines. To make the system work faster, we devised a solution that eliminated all the text's stop words. The SMART information retrieval system at Cornell University, employing the Buckley stop words list [46], is the basis for our suggested technique.

cessing. Simple text segmentation is a sort of pre-processing in which text is divided into meaningful chunks. Separating text into individual phrases, words, or themes is a common practice. Steps such as information extraction, semantic role labels, syntax parsing, Stemming is another text pre-processing step. Currently, there are several Englishlanguage stemmers to choose from that are comprehensive and in-depth. The well-known English stemmers, such as Nice Stemmer, Text Stemmer, and Porter Stemmer, are only a few

machine translations, and plagiarism detection all rely heavily on this stage. Boundary detection and text segmentation are used to conduct sentence segmentation. This is the most common way to denote the end of a phrase, using a period (.), exclamation point (!),

was sentence-based text segmentation, in which the original and comparison texts were

examples. A term's inflectional and derivationally related forms are reduced to a generic base form, using the Porter Stemming method. As an example, consider the following:

am, is, are ⇒ be article, articles, article's, articles' ⇒ article

Information retrieval challenges such as word form variations may be addressed with stemming (Lennon et al., 1981). It is not uncommon for a word to be misspelled or a phrase to be shortened or abbreviated, for a variety of reasons.

The stemming process produces a different word n-gram set, which is then used for similarity matching between texts using the proposed method.

#### *4.2. Extraction of Arguments and Semantic Role Labeling*

Semantic role labeling is a technique for identifying and classifying arguments in a piece of writing. Essentially, a semantic analysis of a text identifies all of its other concepts' arguments. In addition to determining "subject", "object", "verb", or "adverb", it may also be used to characterize elements of speech. Each word in the suspected and source sentences is labeled with its matching role throughout the roles labeling procedure. As a result of this research, semantic role labeling based on the sentence level was offered as a unique plagiarism detection approach.

Using semantic role labeling (SRL), a method for comparing the semantic similarity of two papers, one may determine whether the ideas in both documents are arranged similarly. In this research, ideas were labeled with role labels and gathered into groups. Groups were employed in this manner to as a fast guide to collect the suspicious portion of the text. An example of a plagiarized situation is found here:

Example (1):

Chester kicked the ball. (Original), the ball was kicked by Chester. (Suspected)

By using the Online Demo of SRL (http://cogcomp.cs.illinois.edu/page/research, accessed on 22 July 2022), the produced arguments are:

An original ("Chester kicked the ball") and suspected ("The ball was kicked by Chester") phrases analysis using SRL are shown in Figures 3 and 4: *Mathematics* **2022**, *10*, x FOR PEER REVIEW 9 of 23 *Mathematics* **2022**, *10*, x FOR PEER REVIEW 9 23


**Figure 3.** An analysis of the original phrase based on SRL. **Figure 3.** An analysis of the original phrase based on SRL. **Figure 3.** An analysis of the original SRL.


**Figure 4.** An analysis of the suspected phrase based on SRL. **Figure 4.** An analysis of the suspected phrase based on **Figure 4.** An analysis of the suspected phrase based on SRL.

The syntax of the two phrases above may alter depending on whether the active or passive voice is employed if synonyms and antonyms are utilized. In fact, the semantics of these two phrases are quite similar. In spite of the labels being moved about inside the The syntax of two phrases may alter depending on the active voice is employed if and are utilized. fact, of these two of the labels being about inside The syntax of the two phrases above may alter depending on whether the active or passive voice is employed if synonyms and antonyms are utilized. In fact, the semantics of these two phrases are quite similar. In spite of the labels being moved about inside

comparison of the sentence's arguments, is supported by this capture.

comparison of the sentence's arguments, is supported by this capture.

sentences, the SRL still manages to capture the arguments (subject, object, verb, and indirect object) for a sentence. Our suggested approach of plagiarism detection, based on a

SRL to capture the arguments (subject, object, verb, and rect object) for a sentence. Our suggested approach of plagiarism detection, based on a

checked for similar keywords. When two words are found to be the same, we go straight to the argument label and compare the phrases in which they are conveyed. After identifying potential plagiarized phrases, this phase compares the argument labels of those sentences with the argument labels of the original phrases. In order to make an accurate comparison, the words must be compared correctly. The plagiarism ratio may be incorrect if we compare the phrases in Arg0 (subject) in the suspected text with all other arguments in the original text. For example, comparing the subject with the adjective argument (Arg– Adj) to the subject with the time argument (Arg–TMP) is an unfair general-purpose argu-

the argument and compare the phrases which they are plagiarized this phase compares argument labels those tences with labels of original phrases. order to make an the words must be correctly. plagiarism ratio may incorrect compare phrases in (subject) in the with other

original text. For example, comparing subject with adjective argument Adj) to the subject with the time argument (Arg–TMP) is an unfair general-purposeargu-

sentence phrase those in the original phrase see they comparable. Subject subject comparisons, to verb comparisons, all our

SRL technique. will reduce of comparisons to make. parison will be made arguments in questionable and from

actual source materials. When comparing original and suspected phrases, we can see that active and passive synonyms have different structures and term positions when compared to their passive counterparts, as seen in this example. These two phrases are, in fact,

within phrases, their technique capture the semantic meaning of a Using WordNet concept extraction, suggested of plagiarism

similar keywords. When two are to be the same, we go straight

String matching [48,49] and n-gram [15] are two examples of approaches that compare each word in a suspected sentence to the original phrase. The terms "ball", "kicked", and "Chester" will be compared. Aside from the fact that this comparison is incorrect, it also consumes comparison time. Our technique compares the reasons in the suspected sentence phrase to those in the original phrase to see whether they are comparable. Subject to subject comparisons, verb to verb comparisons, etc., are all possible with our suggested SRL technique. This will reduce the number of comparisons we have to make. No comparison will be made between arguments in questionable papers and arguments from the actual source materials. When comparing original and suspected phrases, we can see that active and passive synonyms have different structures and term positions when compared to their passive counterparts, as seen in this example. These two phrases are, in fact, semantically interchangeable. The researchers have found that, despite altering synonyms within phrases, their technique managed to capture the semantic meaning of a statement. Using the WordNet concept extraction, our suggested approach of plagiarism detection

String matching [48,49] and n-gram [15] are two examples of approaches that compare word in a suspected to the original "ball", "Chester" will compared. Aside from fact this is incorrect, it

comparison time. Our technique reasons in the

interchangeable. researchers have found that, altering

ment (Arg–O).

ment (Arg–O).

may be supported.

*4.3. Concept Extraction* 

the sentences, the SRL still manages to capture the arguments (subject, object, verb, and indirect object) for a sentence. Our suggested approach of plagiarism detection, based on a comparison of the sentence's arguments, is supported by this capture.

According to the SRL scheme of similarity [47], original and suspected papers were checked for similar keywords. When two words are found to be the same, we go straight to the argument label and compare the phrases in which they are conveyed. After identifying potential plagiarized phrases, this phase compares the argument labels of those sentences with the argument labels of the original phrases. In order to make an accurate comparison, the words must be compared correctly. The plagiarism ratio may be incorrect if we compare the phrases in Arg0 (subject) in the suspected text with all other arguments in the original text. For example, comparing the subject with the adjective argument (Arg–Adj) to the subject with the time argument (Arg–TMP) is an unfair general-purpose argument (Arg–O).

String matching [48,49] and n-gram [15] are two examples of approaches that compare each word in a suspected sentence to the original phrase. The terms "ball", "kicked", and "Chester" will be compared. Aside from the fact that this comparison is incorrect, it also consumes comparison time. Our technique compares the reasons in the suspected sentence phrase to those in the original phrase to see whether they are comparable. Subject to subject comparisons, verb to verb comparisons, etc., are all possible with our suggested SRL technique. This will reduce the number of comparisons we have to make. No comparison will be made between arguments in questionable papers and arguments from the actual source materials. When comparing original and suspected phrases, we can see that active and passive synonyms have different structures and term positions when compared to their passive counterparts, as seen in this example. These two phrases are, in fact, semantically interchangeable. The researchers have found that, despite altering synonyms within phrases, their technique managed to capture the semantic meaning of a statement. Using the WordNet concept extraction, our suggested approach of plagiarism detection may be supported.

#### *4.3. Concept Extraction*

The extraction of concepts is an important part of our detection process. WordNet [50] is used in this research as a source of synonyms and related words. It is one of the lexical semantic connections, which are relationships between words. The WordNet system quantifies semantic similarity, since the closer two words are to one another, the more similar is the structure of their connection, and the more frequent are the lexical units shared between them. Using WordNet Thesaurus as a starting point, we begin the process of identifying key concepts. The following are the steps in the procedure: Using the WordNet synset (synonym set) from the words used in the text of the document, the document's terms are mapped onto the WordNet Thesaurus database. WordNet is structured, based on the concept of synsets. A synonym set is a collection of words or phrases that have the same meaning in a certain context. Using an example of a synset from the WordNet Thesaurus database, we can better illustrate what have said so far regarding idea extraction.

Figure 5 demonstrates a synsets extraction from the terms "Canine" and "Chap" [paper: A semantic approach for text clustering using WordNet and lexical chains] and [paper title: Comparative cluster labeling involving external text sources].

In the above example, the synset of terms "Canine" and "Chap", the phrase "canine" for example, may refer to a variety of different things depending on the context: feline, carnivore, automobile, mammal, placental, and many more. The hyponymy (between specialized and more general ideas) and meronymy (between parts and holes) are examples of semantic interactions that might connect synsets together. Figure 5 provides an example of synset relations using WordNet database.

title: Comparative cluster labeling involving external text sources].

**Figure 5.** Terms synsets extraction. **Figure 5.** Terms synsets extraction.

#### In the above example, the synset of terms "Canine" and "Chap", the phrase "canine" **5. Fuzzy SRL**

for example, may refer to a variety of different things depending on the context: feline, carnivore, automobile, mammal, placental, and many more. The hyponymy (between spe-Semantic role labeling is a method for detecting plagiarism by comparing two phrases' semantic similarities. In this section, the concept of the suggested approach is detailed.

The extraction of concepts is an important part of our detection process. WordNet [50] is used in this research as a source of synonyms and related words. It is one of the lexical semantic connections, which are relationships between words. The WordNet system quantifies semantic similarity, since the closer two words are to one another, the more similar is the structure of their connection, and the more frequent are the lexical units shared between them. Using WordNet Thesaurus as a starting point, we begin the process of identifying key concepts. The following are the steps in the procedure: Using the Word-Net synset (synonym set) from the words used in the text of the document, the document's terms are mapped onto the WordNet Thesaurus database. WordNet is structured, based on the concept of synsets. A synonym set is a collection of words or phrases that have the same meaning in a certain context. Using an example of a synset from the WordNet Thesaurus database, we can better illustrate what have said so far regarding idea extraction. Figure 5 demonstrates a synsets extraction from the terms "Canine" and "Chap" [paper: A semantic approach for text clustering using WordNet and lexical chains] and [paper

cialized and more general ideas) and meronymy (between parts and holes) are examples of semantic interactions that might connect synsets together. Figure 5 provides an example of synset relations using WordNet database. The SRL similarity metric introduced and explained in [28] is used to determine the argument similarity score. For plagiarism detection, fuzzy is utilized as an argument selection technique to choose the most relevant arguments.

**5. Fuzzy SRL**  Semantic role labeling is a method for detecting plagiarism by comparing two phrases' semantic similarities. In this section, the concept of the suggested approach is detailed. The SRL similarity metric introduced and explained in [28] is used to determine the argument similarity score. For plagiarism detection, fuzzy is utilized as an argument selection technique to choose the most relevant arguments. Using a fuzzy decision-making framework, Vieira [34] suggested a fuzzy criterion for feature selection. Classical multi-objective optimization has the challenge of balancing Using a fuzzy decision-making framework, Vieira [34] suggested a fuzzy criterion for feature selection. Classical multi-objective optimization has the challenge of balancing the weights of many objectives; our technique does not have that problem. Fuzzy logic elements were used into our system to determine the degree of resemblance between the suspect and the source documents. In the FIS system, we created a feature vector for each sentence (S) as follows: S = A F1, A F2,..., where A F1 represents the first argument feature, and so on. By comparing the documents, we may infer their values. Following the fuzzy logic approach, the arguments score is formed, and then a final set of high-scoring arguments is selected to combine with the similarity detection based on the comparison. Algorithm 1 below outlines the phases of our new technique.

the weights of many objectives; our technique does not have that problem. Fuzzy logic elements were used into our system to determine the degree of resemblance between the Stemming, on the other hand, is a text pre-processing technique. Stemming is a solution to the issue of word form variation in information retrieval [34]. Spelling mistakes, alternative spellings, multi-word constructions, transliteration, affixes, and abbreviations are the most prevalent kinds of variation. Matching algorithms suffer from a lack of efficiency because of the wide range of word forms used in the information retrieval process. Using root words in pattern matching improves information retrieval significantly. This phase was completed using a Porter stemming strategy [35]. Extracting the most important words from a piece of text is an important part of our suggested technique. Because of this, our suggested method's ability to detect similarity between papers may suffer.


**Algorithm 1** An improved plagiarism detection method based on fuzzy logic.

#### *5.1. Membership Functions and Inference System* A fuzzy system relies on the ability to make inferences. In order to perform fuzzy

A fuzzy system relies on the ability to make inferences. In order to perform fuzzy reasoning, the data gathered through the fuzzification process are combined with a sequence of production rules [34]. To translate numerical data into linguistic variables and execute reasoning, fuzzy expert systems and fuzzy controllers need preset membership functions and fuzzy inference rules [35]. The magnitude of each input's involvement is represented graphically by the membership function. reasoning, the data gathered through the fuzzification process are combined with a sequence of production rules [34]. To translate numerical data into linguistic variables and execute reasoning, fuzzy expert systems and fuzzy controllers need preset membership functions and fuzzy inference rules [35]. The magnitude of each input's involvement is represented graphically by the membership function. Fuzzy logic-based plagiarism detection was implemented with different inputs, us-

Fuzzy logic-based plagiarism detection was implemented with different inputs, using a similarity score between individual arguments of original sentences and suspected sentences and one output value of y, which is a similarity score between all arguments of the original and suspected sentences. This was done in order to implement our proposed method. To demonstrate Mamdani's fuzzy inference given a collection of fuzzy rules, the goal is to provide an example. To represent each individual linguistic variable, there are many kinds of "membership functions" on the inputs and outputs in this system. The linguistic variables for x and y, for example, comprise significant and insignificant components that must be considered. These functions are shown in Figure 6. ing a similarity score between individual arguments of original sentences and suspected sentences and one output value of y, which is a similarity score between all arguments of the original and suspected sentences. This was done in order to implement our proposed method. To demonstrate Mamdani's fuzzy inference given a collection of fuzzy rules, the goal is to provide an example. To represent each individual linguistic variable, there are many kinds of "membership functions" on the inputs and outputs in this system. The linguistic variables for x and y, for example, comprise significant and insignificant components that must be considered. These functions are shown in Figure 6.

There were two linguistic values assigned to each input in the suggested method, "important" and "unimportant". Similarity scores were calculated for input and output

When dealing with an inference engine, a good understanding of the fuzzification rules is critical. The fuzzification rules base comprising the IF–THEN rules generates the linguistic parameters for the middle and yield variables outlined above. This set of IF– THEN rules extracts the most significant arguments based on our criterion. Based on the input characteristics, a popular approach for constructing rules was used to extract and create all available rules. The following equation was used to obtain the total number of

was used to calculate the membership function. Using this toolkit, non-linear processes with fuzzy rules created automatically in the FIS environment may be perfectly modelled. All of this information was entered into a computer program that determined the answer. Each rule in the system was seen as crucial to the generation of numerical forecasts. Although each argument's similarity score was used to reflect its input value, an overall score was used to show how similar each argument was to other likely suspect phrases. Section

**Figure 6.** Input MF for fuzzy model. rules: **Figure 6.** Input MF for fuzzy model.

5 explains how the arguments' similarity was determined.

*5.2. Fuzzy IF–THEN Rules Construction* 

There were two linguistic values assigned to each input in the suggested method, "important" and "unimportant". Similarity scores were calculated for input and output using the fuzzy membership function, which yielded important and unimportant scores, depending on whether the score was larger or smaller than 0.5. FIS Toolbox in MATLAB was used to calculate the membership function. Using this toolkit, non-linear processes with fuzzy rules created automatically in the FIS environment may be perfectly modelled. All of this information was entered into a computer program that determined the answer. Each rule in the system was seen as crucial to the generation of numerical forecasts. Although each argument's similarity score was used to reflect its input value, an overall score was used to show how similar each argument was to other likely suspect phrases. Section 5 explains how the arguments' similarity was determined.

#### *5.2. Fuzzy IF–THEN Rules Construction*

When dealing with an inference engine, a good understanding of the fuzzification rules is critical. The fuzzification rules base comprising the IF–THEN rules generates the linguistic parameters for the middle and yield variables outlined above. This set of IF–THEN rules extracts the most significant arguments based on our criterion. Based on the input characteristics, a popular approach for constructing rules was used to extract and create all available rules. The following equation was used to obtain the total number of rules: *Mathematics* **2022**, *10*, x FOR PEER REVIEW 13 of 23 *<sup>n</sup> R f* = (3)

$$
\mathcal{R} = f^n \tag{3}
$$

where *R* denotes the rules; *f* denotes the features input; *n* signifies the rule's logic of possibility. sibility. For example, in a a five-input system with two logic outputs for each input (true and

For example, in a a five-input system with two logic outputs for each input (true and false), the total number of rules created is 32. All potential rules to help the inference system to distinguish between significant and unimportant arguments were generated by Equation (3) using our suggested technique. false), the total number of rules created is 32. All potential rules to help the inference system to distinguish between significant and unimportant arguments were generated by Equation (3) using our suggested technique.

Our suggested technique was put to the test with over 1000 papers, yielding a massive number of rules. Although it was difficult to capture all of the created rules in the fuzzy system, it was a crucial concern. It was imperative that the number of rules created could be reduced. This issue was addressed using a mix of rule reduction techniques [34]. These rules reflect the membership function's inputs and outputs. There were around 1000 papers that were considered to be relevant arguments in the training data set for the proposed technique. Figure 7 depicts the three-dimensional fuzzy rule graphs of our suggested technique. Our suggested technique was put to the test with over 1000 papers, yielding a massive number of rules. Although it was difficult to capture all of the created rules in the fuzzy system, it was a crucial concern. It was imperative that the number of rules created could be reduced. This issue was addressed using a mix of rule reduction techniques [34]. These rules reflect the membership function's inputs and outputs. There were around 1000 papers that were considered to be relevant arguments in the training data set for the proposed technique. Figure 7 depicts the three-dimensional fuzzy rule graphs of our suggested technique.

In order to improve the detection of similarities, the suggested method aims to choose only the strongest reasons that may have a significant impact on the plagiarism process. The fuzzy IF–THEN rule base is an important part of FIS. Prior to reducing the number of rules, all available rules were retrieved. The most essential arguments were chosen for the second round of testing comparisons. Arguments that were deemed insignificant by the

therefore lowering the insignificant arguments leads to an increase in the similarity score, as was discovered when comparing the findings from the first test. The CS11 human plagiarism corpus was used to obtain the matching score. In the next sections, the calculation

Defuzzification is the final phase in the fuzzy logic procedure. A final score is assigned to each argument during defuzzification, based on the inference system findings. A fuzzy set's aggregate output is utilized as the input and the outcome is a single value. Defuzzification must be finished before a single value output may be generated. According to Mogharreban [34], there are several defuzzification methods. Fuzzy reasoning sys-

The maximum mean: The mean of maxima is computed using the distribution of

tems benefit from our usage of the maximum mean defuzzification approach.

output to get a single value. The equation below shows how this is done:

**Figure 7.** Important argument similarity using fuzzy model. **Figure 7.** Important argument similarity using fuzzy model.

of similarity is detailed.

*5.3. Defuzzification* 

In order to improve the detection of similarities, the suggested method aims to choose only the strongest reasons that may have a significant impact on the plagiarism process. The fuzzy IF–THEN rule base is an important part of FIS. Prior to reducing the number of rules, all available rules were retrieved. The most essential arguments were chosen for the second round of testing comparisons. Arguments that were deemed insignificant by the FIS were not taken into consideration. After deciding on the arguments, a test was carried out. The degree of similarity relies on the number of reasons retrieved from the sentences, therefore lowering the insignificant arguments leads to an increase in the similarity score, as was discovered when comparing the findings from the first test. The CS11 human plagiarism corpus was used to obtain the matching score. In the next sections, the calculation of similarity is detailed.

#### *5.3. Defuzzification*

Defuzzification is the final phase in the fuzzy logic procedure. A final score is assigned to each argument during defuzzification, based on the inference system findings. A fuzzy set's aggregate output is utilized as the input and the outcome is a single value. Defuzzification must be finished before a single value output may be generated. According to Mogharreban [34], there are several defuzzification methods. Fuzzy reasoning systems benefit from our usage of the maximum mean defuzzification approach.

The maximum mean: The mean of maxima is computed using the distribution of output to get a single value. The equation below shows how this is done:

$$\frac{\sum\_{j=1}^{q} Z\_j \mu\_c(Z\_j)}{\sum\_{j=1}^{q} \mu\_c(Z\_j)}\tag{4}$$

$$\sum\_{j=1}^{1} \frac{Z\_j}{j} \tag{5}$$

where I is the time when the distribution output hits the maximum level of *z j, z* is the mean of maximum, and *z j* is the membership function's maximum point.

#### **6. Experimental Design and Dataset**

Experiments were conducted to determine how many sentences from the original papers were found to be plagiarized. The tests were carried out on a PAN-PC dataset [2]. According to the PAN-PC dataset [2], each of these texts was based on one or more original parts. It was decided to use the new method by looking for allegedly suspicious original texts. There were many groupings of texts, each with a particular quantity of types. When comparing the two groups, and the number of texts for each group, the first set consisted of five texts. Then, five more texts were added to the initial group, followed by 10, 20, 40, 100, and 1000. Grouping is a useful technique [5] for identifying how a plagiarized argument performs under various conditions. A total of 1000 documents were used in the studies after analyzing the arguments' behavioral patterns. As input variables in FIS, each group and each argument was selected. This results in a tally of how closely these individuals are related to one another. The input variable's values are a similarity score for each pair of arguments that are comparable. As part of the data training, the trials were carried out on the PAN-PC dataset. After that, it was put to the test on a large sample size of 1000 documents. It was discovered that important arguments may be picked using FIS. After the arguments were picked, a second round of testing was conducted. The degree of similarity was discovered to be dependent on the number of reasons retrieved from the sentences, and by lowering the unimportant arguments, the similarity score was found to be higher, as was the case with the initial testing. It was subsequently determined that the PAN-PC dataset was utilized to cross-check the results. Below, we explain how that number was determined.

The CS11 human corpus was used in an additional experiment. The problem with the PAN-PC corpus is that most of the plagiarism instances were intentionally manufactured. There are 100 instances of plagiarism in the CS11 people short answer questions corpus, according to Clough and Stevenson [36]. Examples of plagiarized texts of varying degrees of plagiarization may be found in this resource. Since the Clough and Stevenson corpus was created and built by real people rather than computer programs, it provides a more realistic picture of the actions of people who have copied work. Each document in the corpus has at least one suspiciously copied section, as well as five original sentences taken from Wikipedia. Native and non-native speaking students were asked to respond to five questions, based on the original materials. Answers were based on the instructions provided by the corpus designers, with the exception of non-plagiarized examples, and were based on actual texts with varying degrees of text overlap. Average word counts for the short sections were in the tens of words (200–300). Near-copy (19), heavy revision (19), and light revision (19) instances were found in 57 samples, while the remaining 38 samples were found to be plagiarism-free. The following are examples of questionable documents:


The matching arguments and the arguments included in the sentences are both taken into account when determining similarity. When comparing the two documents, the first variable identifies arguments that are similar in both, while the second identifies arguments that do not appear in either text. The Jaccard coefficient was used to determine the matching among the arguments in the original and the suspected texts.

$$\text{similarity}(\mathbf{C}\_i(\text{Arg}T\_j, \text{Arg}T\_k)) = \frac{\mathbf{C}(\text{Arg}T\_j) \cap \mathbf{C}(\text{Arg}T\_k)}{\mathbf{C}(\text{Arg}T\_j) \cup \mathbf{C}(\text{Arg}T\_k)} \tag{6}$$

where *Ci (ArgTk)* = ideas of the original document's argument text; *C(ArgTj)* = concepts of the suspected document's argument text.

Using the following equation, we estimated similarity between the original and the suspicious texts:

$$\text{TS}(\text{txt1}, \text{txt2}) = \sum\_{i=1, l} \sum\_{\substack{j=1, m \\ k=1, n}} \text{SimC}\_{i}(\text{ArgT}\_{j}, \text{ArgT}\_{k}) \tag{7}$$

where TS is the total similarity score, *m* = the number of arguments text in the original document, *n* = the number of arguments text in the suspected document, and *i* = the matching between the arguments text in the original text with concept *i* and the suspected text with that concept, along with the number of concepts.

#### **7. Results and Discussion**

Plagiarized materials were copied and pasted, synonyms were changed, and sentences were restructured in a variety of ways (paraphrasing). Three typical testing measures for plagiarism detection were utilized as described in the Equations (8)–(10).

$$\text{Recall} = \frac{(\text{No of detected ags})}{(\text{Total no of ags})} \tag{8}$$

$$\text{Precision} = \frac{(\text{No of plagirized Args})}{(\text{No of detected Args})} \tag{9}$$

$$\text{F}-\text{measure} = \frac{(2 \times \text{Recall} \times \text{Precision})}{(\text{Recall} + \text{precision})} \tag{10}$$

Using the collection of documents we chose, we ran the tests shown in Figure 8, which displays the findings. For the similarity computation, a set of documents is represented by a row. Using the collection of documents we chose, we ran the tests shown in Figure 8, which displays the findings. For the similarity computation, a set of documents is represented by a row.

*Mathematics* **2022**, *10*, x FOR PEER REVIEW 15 of 23

matching among the arguments in the original and the suspected texts.

*i jk*

tering the originals

the suspected document's argument text.

with that concept, along with the number of concepts.

suspicious texts:

**7. Results and Discussion** 

• Non plagiarism: participant information was included into the writings without al-

*C ArgT C ArgT similarity C ArgT ArgT C ArgT C ArgT*

where *Ci (ArgTk)* = ideas of the original document's argument text*; C(ArgTj)* = concepts of

1, 1, 1, TS(txt1,txt2)= ( , ) *j m i jk i l k n*

where TS is the total similarity score, *m* = the number of arguments text in the original document, *n* = the number of arguments text in the suspected document, and *i* = the matching between the arguments text in the original text with concept *i* and the suspected text

Plagiarized materials were copied and pasted, synonyms were changed, and sentences were restructured in a variety of ways (paraphrasing). Three typical testing measures for plagiarism detection were utilized as described in the Equations (8)–(10).

(No of detected args) Recall =

(No of plagirized Args) Precision =

Using the following equation, we estimated similarity between the original and the

The matching arguments and the arguments included in the sentences are both taken into account when determining similarity. When comparing the two documents, the first variable identifies arguments that are similar in both, while the second identifies arguments that do not appear in either text. The Jaccard coefficient was used to determine the

( )( ) ( ( , )) ( )( )

<sup>∩</sup> <sup>=</sup> <sup>∪</sup>

*j k*

(6)

*j k*

<sup>=</sup> *SimC ArgT ArgT* <sup>=</sup> <sup>=</sup> (7)

(Total no of args) , (8)

(No of detected Args) , (9)

**Figure 8.** Role similarity scores across the set of documents.

The SRL was employed to break down the text into different arguments, examples of which are shown in Table 1 below.


**Table 1.** Argument types and their descriptions.

Experiments employed a variety of argument types and descriptions, as shown in Table 1.

Each pair of documents is represented in Table 2 by the percentage of similarity between the suspected and original documents. Recall, precision, and F-measures all have scores over 0.58, whereas all recall measures have scores above 0.80. Table 2 shows that the scores are all larger than 0.5, which indicates that the findings are excellent, but it was still possible to enhance these scores to obtain better similarity values.



FIS has shown that the writer who plagiarizes does not concentrate on all of the reasons in a statement, therefore certain arguments are left out. Arguments like this are said to be insignificant. Table 2 shows the outcomes of the FIS cross SRL sentences.

Table 2 shows the ranking of the SRL arguments using FIS. In order to test our method, we used a variety of groupings of documents (5, 10, 20, 40, 100 and 1000). These allegedly plagiarized texts used a variety of plagiarism strategies, including copying and pasting, swapping certain phrases for their counterparts, and altering sentence structure (paraphrasing). There are two kinds of arguments. Both types of arguments have a similarity score larger than 0.5; however, the first form of argument is considered significant while the second type is considered irrelevant. Similarity scores were calculated for input and output using the fuzzy membership function, which yielded important and unimportant scores, depending on whether the score was larger or smaller than 0.5. The comparison step of the proposed technique uses a Jaccard similarity measure [37] with a threshold value of 0.5 [38–40]. For this reason, we chose 0.5 as our cutoff value. In order to enhance the similarity score, the FIS method chose the most essential reasons. On the other hand, the similarity score was reduced by minor arguments. Unimportant arguments were discarded to minimize the general resemblance of the original and suspected texts. To determine the degree to which two arguments are similar, the SRL similarity measure, developed by Osman et al. [28], is used. A table titled the "similarity scores table" shows all of the similarity ratings between the various arguments. An input to the FIS is the similarity score table. Features include the arguments and overall similarity between them, as well as the amount of original and suspected texts in the dataset that were utilized in its construction. This system's goal is to increase the similarity scores in plagiarism detection by generating many key arguments.

A common approach used by those who plagiarize is to concentrate on key phrases and then adapt their work to include them. Only the most crucial points that have a significant impact on the reader would be reworked. There are a number of target selection approaches available, all of which aim to anticipate as accurately as possible the essential objectives of the data. FIS is one of these approaches. Statistical significance test (*t*-test) results were used to demonstrate the benefits of the new strategy. These findings are shown in Table 3 and demonstrate the statistical significance of the suggested approach.


**Table 3.** Statistical significance testing using the *t*-test.

There are many metrics in Table 3 that may be compared using the pair of variables before and after optimization using the FIS-SRL approach, as well as their significance, using the paired samples *t*-test process. Comparing the means of two variables representing the same group at various points in time is done using the paired samples *t*-test technique. In the pair of variables statistics table, the mean values of the two variables ((Recall-1, Recall-2); (Precision-1, Precision-2); and (F-measure-1, F-measure-2)) are shown. As a paired samples *t*-test examines two variables' mean values, it is important to know their averages. The *t*-test may be used to determine whether there is a significant difference between two variables if the significance value is less than 0.05. For example, it was found that the suggested technique had significant recall (0.000928), precision (0.000159), and F-measure results in the significance field of Table 3 (Sig. (2-tailed)). This suggests that

the proposed method had significant results in all three areas. The fact that the confidence interval for the mean difference does not include 0 shows that the difference is, likewise, significant. There is also a lack of statistical significance in the F-measure, recall, and precision. Comparison of the outcomes before and after optimization indicates that there is considerable difference.

The PAN-PC dataset evaluates and compares the presented solution with other plagiarism detection systems. Figure 9 shows the comparison findings.

**Figure 9.** Contrast between the current text-based similarity detection methods.

A comparison of SRL fuzzy logic with string-similarity, LCS, graph-based methods, semantic similarity methods, and SRL argument weight methods is shown in Figure 9 [29,30,34–36]. The similarity results were shown to be improved with our new technique.

Our suggested technique was compared to Chong's naive Bayes classifier [51] with a set of all features, best features, and ferret baseline method in the Tables 4–6 for the similarity classes (heavy-revision, light-revision, and near-copy). Table 4 shows the heavy-revision plagiarism class.

**Table 4.** Heavy plagiarism class.


**Table 5.** Light plagiarism class.


**Table 6.** Cut-and-paste plagiarism class.


Table 5 compares the proposed technique to previous methods, based on a mild plagiarism class. Recall, precision, and F-measure were all found to be the best for the suggested technique. Table 5 shows the results for the light-revision plagiarism class.

Table 6 shows an assessment of the suggested approach and other methods on the on copy-and-paste class. We observed that the suggested technique had the highest scores for F-measure, recall, and precision.

In addition, the amount of time it takes to complete a task is also taken into consideration. This metric is often used to evaluate the effectiveness of algorithms. The temporal complexity of the suggested approach was used to assess its suitability. The suggested method was found to be in the same class as the rest of the methods. There are several plagiarism detection methods in this class, according to research by Maxim Mozgovoy and JPlag [34,35]. Even so, they observed that certain plagiarism detection algorithms have a time complexity of O(f(n)N2), where f(n) is the time it takes to compare a pair of files with a length of n and N is the collection size (number of files). Time-consuming techniques, such as fuzzy semantic comparison and semantic-based string similarity, were compared. It was demonstrated that semantic-based string similarity, LCS, and semantic-based similarity all have the same level of temporal complexity as the proposed method. Table 7 displays the results, in terms of how long each type of method takes.


**Table 7.** Time complexity comparison.

On the other hand, the string similarity-based fuzzy semantic method, semantic similarity, the similarity based on SRL method, the similarity based on graph-based representation method, and similarity based on sentence-NLP all have higher temporal complexity than ours, as shown in Table 7. The findings reveal that the suggested technique falls within a category of detection algorithms that is generally recognized. There are three major differences between our suggested approach and previous methods:

When it comes to copying and pasting, rewording or replacing words, changing the voice of a phrase from active to passive or vice versa, or changing the word structure in phrases, are all instances of plagiarism that may be caught using the method we provide.

In contrast to earlier methods, which focused on more traditional comparison techniques like character-based and string matching, the SRL is used as a comparison mechanism to analyze and compare text to identify instances of plagiarism. According to our results on the PAN PC-09 dataset, we are able to outperform other methods for detecting plagiarism, such as longest common subsequence [52], graph-based method [31], fuzzy semantic-based string similarity [49], and semantic-based similarity [53]., Additionally, we

found that our technique outperforms other methods described by Chong [51], including naive Bayes classifier and ferret baseline, on CS11 corpora.

#### **8. Conclusions and Future Work**

The current study offers a plagiarism detection system that includes the following steps: the first and second documents are uploaded into a database, where the text is processed to be segmented into sentences, stop words are eliminated, and words are stemmed to their original forms. Next, the processed text is parsed in each document to find any arguments within, and then each argument found is represented as a member of a group, to determine how similar the groups of text are to one another. To select the best arguments from the text, the FIS has been applied. For plagiarism detection, semantic role labeling may be utilized by extracting the arguments of sentences and comparing the arguments. A FIS has been used to choose the arguments that have the most impact. When performing the similarity calculation, only the most essential reasons were taken into consideration, thanks to the use of FIS. The standard datasets for human plagiarism detection (CS-11) have been tested. In comparison to fuzzy semantic-based string similarity, LCS, and semantic-based approaches, the suggested approach has been proven to perform better.

A common approach used by those who plagiarize is to concentrate on key phrases and then adapt their work to include them. The proposed method proved that crucial points that have a significant impact on the reader should be reworked. The study aimed to anticipate, as accurately as possible, the essential objectives of the data. The results of statistical significance tests demonstrated the impact and benefits of the new strategy, compared with methods of plagiarism detection based on other strategies.

The limitation of this research must also be emphasized. This research did not cover some types of plagiarism, such as the similarity of non-textual content elements, citations, illustrations, tables, and mathematical equations, and these types are frequently discussed in studies.

To conclude, the methods of paraphrase type identification suggested in this research can be used and improved in a wide range of academic contexts. This involves not only support in identifying plagiarism, but also a focus on upholding ethical academic conduct. Genetic algorithms may be used to improve the results that can be produced, by employing the FIS in future. In addition, the above-mentioned limitation of this study is still considered as a research gap, which will be filled in the future.

**Author Contributions:** Conceptualization, A.H.O. and H.M.A.; methodology, A.H.O.; validation, A.H.O. and H.M.A.; formal analysis, A.H.O.; investigation, A.H.O.; resources, A.H.O.; data curation, A.H.O. and H.M.A.; writing—original draft preparation, A.H.O. and H.M.A.; writing—review and editing, A.H.O. and H.M.A.; visualization, A.H.O.; supervision, A.H.O.; project administration, A.H.O.; funding acquisition, A.H.O. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research work was funded by the Institutional Fund Projects under grant no. (IFPIP: 481-830-1443). The authors gratefully acknowledge the technical and financial support provided by the Ministry of Education and King Abdulaziz University, DSR, Jeddah, Saudi Arabia.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** This research work was funded by the Institutional Fund Projects under grant no. (IFPIP: 481-830-1443). The authors gratefully acknowledge the technical and financial support provided by the Ministry of Education and King Abdulaziz University, DSR, Jeddah, Saudi Arabia.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
