*Article* **Determination of a Good Indicator for Estimated Prime Factor and Its Modification in Fermat's Factoring Algorithm**

**Rasyid Redha Mohd Tahir <sup>1</sup> , Muhammad Asyraf Asbullah 1,2,\* and Muhammad Rezal Kamel Ariffin 1,3 and Zahari Mahad <sup>1</sup>**


<sup>3</sup> Mathematic Department, Faculty of Science, Universiti Putra Malaysia, Serdang 43400, Malaysia

**\*** Correspondence: ma\_asyraf@upm.edu.my

**Abstract:** Fermat's Factoring Algorithm (FFA) is an integer factorisation methods factoring the modulus *N* using exhaustive search. The appearance of the Estimated Prime Factor (EPF) method reduces the cost of FFA's loop count. However, the EPF does not work for balanced primes. This paper proposed the modified Fermat's Factoring Algorithm 1-Estimated Prime Factor (mFFA1-EPF) that improves the EPF method. The algorithm works for factoring a modulus with unbalanced and balanced primes, respectively. The main results of mFFA1-EPF focused on three criteria: (i) the approach to select good candidates from a list of convergent continued fraction, (ii) the establishment of new potential initial values based on EPF, and (iii) the application of the above modification upon FFA. The resulting study shows the significant improvement that reduces the loop count of FFA1 via (improved) EPF compared to existing methods. The proposed algorithm can be executed without failure and caters for both the modulus *N* with unbalanced and balanced primes factor. The algorithm works for factoring a modulus with unbalanced and balanced primes.

**Keywords:** estimated prime factor; integer factorisation problem; continued fraction; Fermat's Factoring Algorithm

#### **1. Introduction**

Cryptography has its crucial parts in Industrial Revolution 4 (IR4) where technology is embedded in artificial intelligent to maintain the secureness of the information data. Regarding cryptography, there are two types of cryptography: symmetric and asymmetric cryptography. Symmetric cryptography uses the same key for the encryption and decryption process while asymmetric cryptography uses different keys for each encryption and decryption process. A lot of asymmetric cryptography strength relies on the Integer Factorisation Problem (IFP). IFP is one of the oldest hard mathematical problems in history. IFP is defined as finding the two distinct primes, *p* and *q*, for a given integer (a modulus) *N* = *pq*, which is the multiplication of those two primes. We et al. [1] mentioned that from the existing classical sense of computation, a modulus with a minimum 1024-bit length is still very hard to be factorised. There are several general purpose algorithms to solve the IFP, such as Pollard's *p* − 1, General Number Field Sieve, Quadratic Sieve, Elliptic Curve Factoring, and Fermat's Factoring Algorithm [2].

Pierre de Fermat explored Fermat's Factoring Algorithm (FFA) as one of the IFP methods that is used to factor the modulus *N* with balanced primes (Ambedkar et al. [3]). According to Somsuk and Tientanopajai [4], the modulus *N* in FFA is written as *N* = *p*+*q* 2 2 − *p*−*q* 2 2 . In this work, the FFA is categorised into Fermat's Factoring Algorithm 1 (FFA1) and Fermat's Factoring Algorithm 2 (FFA2). FFA1 uses a square root, while FFA2

**Citation:** Tahir, R.R.M.; Asbullah, M.A.; Ariffin, M.R.K.; Mahad, Z. Determination of a Good Indicator for Estimated Prime Factor and Its Modification in Fermat's Factoring Algorithm. *Symmetry* **2021**, *13*, 735. https://doi.org/10.3390/ sym13050735

Academic Editor: Juan Alberto Rodríguez Velázquez

Received: 11 February 2021 Accepted: 13 March 2021 Published: 21 April 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

uses multiplication as their main processes that lead to factorization of *N*. Both methods are having their advantages and disadvantages in terms of the number of loop count (iteration) and computational time to complete the factoring process. Many studies have introduced to improve the FFA, to make the algorithm efficient for factoring the modulus *N* [5–8]. The main purpose is to speed up the algorithms: either to reduce the loop count or to improve the algorithm's computational time for exhaustive search or both [9].

The EPF is introduced by Wu et al. [1]. Wu et al's study enhances the efficiency of FFA2 by shortening the search for the target value of *p* + *q* and *p* − *q*. The EPF method was adopted as a mechanism to reset the initial values of FFA (in this case is FFA2), which results in reducing the loop count for the FFA to complete the search and successfully factor the modulus *N*. The authors of [1,9] use the continued fraction of √ 1 *N* to produce a list of convergent and create an additional extension for the initial values. Potentially, EPF is considered as a good "device" to increase the efficiency of FFA.

However, as reported in [1], the absence of a deterministic approach to select the required parameter in EPF is the limitations of such an approach, and most of the cases cannot work on balanced primes. Somsuk [5] agreed that EPF works perfectly only on unbalanced primes. By observation and empirical evidence, the selected values on the list of convergent certainly cannot be used for the initial value because it may cause the FFA2 algorithm to fail. On the other hand, the authors of [1] overlook a convergent-selecting case that affects the effectiveness of EPF. In finding a solution regarding EPF, FFA1 is chosen to be the main integer factorisation method in this study. This is because FFA1 potentially avoids the failure of running algorithm via EPF and reduce the exhaustive search as it uses a single loop run the algorithm. The resulting study shows a significant improvement that reduces the loop count of FFA1 via (improved) EPF compared to previous methods (FFA1, FFA2, FFA2-EPF, and FFA-Euler).

The rest of this paper is sorted as follows. Section 2 introduces the background of the FFA1, FFA2, and EPF. The definition of a modulus *N* is provided considering unbalanced and balanced primes. Section 3 discusses the methodology that will support the finding of this work. Section 4 presents the mFFA1-EPF, which works on factoring a modulus *N* for both unbalanced and balanced primes. The results of numerical examples are shown and compared with other existing FFA-based method. The conclusion is drawn in Section 5.

#### **2. Preliminaries**

Some fundamental information for the study, about FFAs and EPF, and some definitions are provided.

#### *2.1. Balanced and Unbalanced Primes*

This section provided the definitions of balanced and unbalanced primes, which restructure from [10,11], respectively. The definition of balanced prime is as follows.

**Definition 1.** *Let N* = *pq be a number with a multiplication of two primes p and q. The number N is defined to have balanced primes where p and q have the same bit-size and satisfy the relation q* < *p* < 2*q.*

The term of unbalanced primes is defined, as follows.

**Definition 2.** *Let N* = *pq be a number with a multiplication of two primes p and q. The number N is defined to have unbalanced primes where p and q have the different bit-size and satisfy the relation q* < *p* < *αq where α* > 2*.*

#### *2.2. Fermat's Factoring Algorithm (FFA)*

De Weger [12] studied FFA1 as an approach of IFP to factor the modulus *N* by searching the value of *p* + *q* and *p* − *q*. There is modulus *N* written as the difference of square, *N* = *p*+*q* 2 2 − *p*−*q* 2 2 . As the value of *p* + *q* and *p* − *q* are unknown, we need

to find the closest of those values. According to Asbullah and Ariffin [10], the smallest value of *p* + *q*, based on balanced prime, is 2 √ *N*. We start the initial value *x* = 2d √ *N*e and then, compute *y* = √ *x* <sup>2</sup> − *<sup>N</sup>*. If *<sup>y</sup>* is an integer then we accept the pair (*x*, *<sup>y</sup>*). Otherwise, the algorithm is ran by increasing the value of *x* by 1. If there is a pair of integers (*x*, *y*), then compute the values *p* = *x* + *y* and *q* = *x* − *y*. Note that FFA1 only run a loop on searching the integer value on *p* + *q* via initial value of *x*.

Bressoud [13] introduced FFA2, which is uses two loops on searching *p* + *q* and *p* − *q* to reduce the running time. Wu et al. [1] reformulated the Bressoud's method. Suppose there is modulus *N* = *x* <sup>2</sup> <sup>−</sup> *<sup>y</sup>* <sup>2</sup> where *x* = *p*+*q* 2 and *y* = *p*−*q* 2 . The modulus is derived into 4*N* = *u* <sup>2</sup> <sup>−</sup> *<sup>v</sup>* <sup>2</sup> where *u* = 2*x* and *v* = 2*y*. As the actual values of *u* and *v* are unknown, reset *u* = 2d √ *N*e and *v* = 0. Now compute *r* = *u* <sup>2</sup> <sup>−</sup> *<sup>v</sup>* <sup>2</sup> <sup>−</sup> <sup>4</sup>*N*. If *<sup>r</sup>* <sup>=</sup> 0, thus the solution of (*u*, *v*) is found. There are two cases for value of *r* as *r* 6= 0:


When the initial value *u* and *v* are created, we need to check the value of *r*. If *r* 6= 0, there are two cases: *r* > 0 and *r* < 0. If *r* > 0, the new *v* is produced that is increased by 2 from the old *v*, then the new *r* is computed by *r* − (4*v* + 4). Otherwise, if *r* < 0, the new *u* is produced that is increased by 2 from the old *u*, then the new *r* is computed from *r* + (4*u* + 4). The iteration of both case will be run until *r* = 0. If (*u*, *v*) found, the factorisation of *N* occurs as *p* = *<sup>u</sup>*+*<sup>v</sup>* 2 and *q* = *<sup>u</sup>*−*<sup>v</sup>* 2 .

**Remark 1.** *The sign* ← *represents an assignment of changes on a same value. For example, v* ← *v* + 2 *means the new value v is computed from the old value v with increment by 2.*

#### *2.3. Continued Fraction*

The continued fraction is a non-integer expansion method that represents a decimal number into a list of integers. The list of the integer can establish a partial quotient that brings into a convergent list in term of a rational number. The continued fraction is suitable for the representation of the rational and irrational number like √ *π*, Euler's number, *e*, and 2. Chung et al. [14] mentioned that the continued fraction for a rational number may produce a finite yield of integer number while for an irrational number may produce an infinite yield an infinite list of an integer number.

Let *r* be a real number which has unique continued fraction expansion,

$$r = [m\_0, m\_1, m\_2, \dots, m\_{\dot{i}\prime} \dots] = m\_0 + \frac{1}{m\_1 + \frac{1}{m\_2 + \frac{1}{\dots}}}$$

where *m<sup>i</sup>* ∈ Z and *i* ∈ N. The list of *m<sup>i</sup>* is a list integer form of *r*, thus, let *r<sup>i</sup>* with an amount of *i* represent the partial quotients as follows:

$$\begin{aligned} r\_0 &= m\_0\\ r\_1 &= m\_0 + \frac{1}{m\_1} \\ r\_2 &= m\_0 + \frac{1}{m\_1 + \frac{1}{m\_2}} \\ &\vdots\\ r\_i &= m\_0 + \frac{1}{m\_1 + \frac{1}{m\_2 + \frac{1}{\dots + \frac{1}{m\_i}}}} \end{aligned}$$

The list of partial quotients, [*r*0,*r*1,*r*2, . . . ,*r<sup>i</sup>* ] is also known as a convergent list. The list of the convergence is significantly used for several purposes such as shortening the distance of the initial value in Fermat's Factoring Algorithm and creating an approximate value of a rational number. Wu et al. [1] purposed Estimated Prime Factor in which used in application such as shortening the searching distance for (FFA2). It may give a "hint" for a new position for initial values of FFA2. The EPF will be discussed in the next section.

#### *2.4. Estimated Prime Factor (EPF)*

Wu et al. [1,9] proposed an approach to estimate *p* + *q* and *p* − *q* using EPF. The authors of [1,9] mentioned that the continued fraction of √ 1 *N* is used to give out the partial knowledge of *<sup>D</sup>p*−*D<sup>q</sup> DpDq* in which helps to find *p* + *q* and *p* − *q* as *D<sup>p</sup>* − *D<sup>q</sup>* and *DpD<sup>q</sup>* are unknown. From the list of convergent √ 1 *N* , *ht kt* is selected as the additional extension where *<sup>k</sup><sup>t</sup>* . *<sup>N</sup>* and *<sup>h</sup><sup>t</sup>* <sup>&</sup>lt; *<sup>D</sup><sup>p</sup>* <sup>−</sup> *<sup>D</sup><sup>q</sup>* <sup>&</sup>lt; *<sup>h</sup>t*+1. The in-depth discussion of convergent *<sup>h</sup><sup>t</sup> kt* in EPF is provided in Appendix A.

**Remark 2.** *Let k<sup>t</sup> be denominator of <sup>h</sup><sup>t</sup> kt from convergent list pf* √ 1 *N . The value of k<sup>t</sup> is approximately less than N, k<sup>t</sup>* . *N, in which k<sup>t</sup> need to be less and closer to the value of N. As the value of N is known, it is easy to select <sup>h</sup><sup>t</sup> kt by k<sup>t</sup> comparing with the value of N, and k<sup>t</sup> could be good indicator to select a good convergent of* √ 1 *N as there is i convergents on the list.*

We illustrate EPF process in Figure 1.

**Figure 1.** The process of Estimated Prime Factor (EPF).

#### **3. Methodology**

As early discussed in Section 1, Bressoud [13] claimed that the FFA2 has a better component for loop count without any multiplication or division and exhibit faster computational time. However, the FFA2 requires a huge number of cycles because it needs to search for the value of *p* + *q* and *p* − *q*, separately. Compare to FFA2, FFA1 needs to search for the value of *p* + *q* only. This eliminates the process of searching for the value of *p* − *q*, and thus reduces the number of loop count.

Table 1 shows the comparison of loop count between FFA1 and FFA2 based on three distinct moduli *N* = *pq*. The first modulus of *N* = 1,783,647,329 is taken from an example in [1]. Observed that the FFA1 dominates the smallest values of loop count (10,552) rather than the FFA2 (42,215) and the FFA2-EPF (11,455). For *N* = 195,656,557, again, the FFA1 dominates the smallest loop count compare to the other two methods. *N* = 1,952,194,393 is selected to illustrate the balanced prime situation. Similarly, the FFA1 recorded the smallest loop count, while the loop count for FFA2-EPF is not available as the initial value of *u* and *v* are larger than *p* + *q* and *p* − *q*. Overall, the FFA1 requires a lesser loop count, therefore it can factor modulus *N* relatively faster than the FFA2 and the FFA2-EPF.


**Table 1.** The comparison on loop count between FFA1, FFA2, and FFA2-EPF based on 3 distinct modulus *N* = *pq*.

Recall that the initial value of the FFA1 started from <sup>√</sup> *N* and it increased by 1 until it reach the value *<sup>p</sup>*+*<sup>q</sup>* 2 . Therefore, there exists a distance, denoted by *d*<sup>0</sup> from the initial value where it starts from <sup>√</sup> *N* to *<sup>p</sup>*+*<sup>q</sup>* 2 : *d*<sup>0</sup> = *p*+*q* 2 − √ *N*  (as shown in Figure 2). The methodology is to introduce a new parameter *<sup>λ</sup>* <sup>∈</sup> <sup>N</sup>, such that <sup>√</sup> *N* + *λ* will be serve as a new initial value for the FFA1. The reason is to establish a new distance, *dnew* = *p*+*q* <sup>2</sup> − ( √ *N* + *λ*) , as shown in Figure 2, where *dnew* < *d*0. Therefore, a good *λ* is needed toward initial value of FFA1 to obtain the *dnew*.

In this work, the EPF method is used to extend the initial value of FFA1. We show that the EPF approach is suitable to be used in FFA1 with following conditions. Suppose *x* = *p*+*q* 2 and *y* = *p*−*q* 2 . Note that from Section 2.4, we have *p* = √ *N* + *D<sup>p</sup>* and *q* = √ *N* − *Dq*. Recall that modulus *N* = *x* <sup>2</sup> <sup>−</sup> *<sup>y</sup>* <sup>2</sup> = *p*+*q* 2 2 − *p*−*q* 2 2 , therefore

$$\begin{split} N &= \left(\frac{(D\_p + \sqrt{N}) + (\sqrt{N} - D\_q)}{2}\right)^2 - \left(\frac{(D\_p + \sqrt{N}) - (\sqrt{N} - D\_q)}{2}\right)^2 \\ &= \left(\frac{2\sqrt{N} + D\_p - D\_q}{2}\right)^2 - \left(\frac{D\_p + D\_q}{2}\right)^2 \\ &= \frac{4N + 4\sqrt{N}(D\_p - D\_q) + (D\_p - D\_q)^2}{4} - \frac{(D\_p + D\_q)^2}{4} \\ 4N &= 4N + 4\sqrt{N}(D\_p - D\_q) + (D\_p - D\_q)^2 - (D\_p + D\_q)^2 \\ \sqrt{N}(D\_p - D\_q) &= 4D\_p D\_q \\ \sqrt{N}(D\_p - D\_q) &= D\_p D\_q \end{split} \tag{1}$$

$$\begin{aligned} 4\sqrt{N}(D\_p - D\_q) &= 4D\_p D\_q\\ \sqrt{N}(D\_p - D\_q) &= D\_p D\_q\\ \frac{D\_p - D\_q}{D\_p D\_q} &= \frac{1}{\sqrt{N}} \end{aligned}$$

From Equation (3), the value *D<sup>p</sup>* − *D<sup>q</sup>* is needed to improve FFA1, which obtained from the convergences of continued fraction of √ 1 *N* . Recall from Section 2.4, from the continued fraction of √ 1 *N* will established a list of *<sup>h</sup><sup>i</sup> ki* . As a candidate *<sup>D</sup>p*−*D<sup>q</sup> DpDq* , the *<sup>h</sup><sup>t</sup> kt* is selected with *k<sup>t</sup>* . *N* and *h<sup>t</sup>* be the additional extension *λ* to improve the initial value of FFA1.

**Figure 2.** The position of *x* = √ *N* and *<sup>p</sup>*+*<sup>q</sup>* 2 .

Now, the bound of *D<sup>p</sup>* − *D<sup>q</sup>* will be consider. Recall in Section 2.4 where *p* = *D<sup>p</sup>* + √ *N* and *q* = √ *N* − *Dq*, *p* + *q* = 2 √ *N* + *D<sup>p</sup>* − *Dq*, we have *D<sup>p</sup>* − *D<sup>q</sup>* = *p* + *q* − 2 √ *N*. As *h<sup>t</sup>* < *D<sup>p</sup>* − *Dq*, thus

$$h\_t < D\_p - D\_q = p + q - 2\sqrt{N}.\tag{2}$$

We proceed with a lemma that shows *x* = d √ *N* + *ht* 2 <sup>e</sup> is always smaller than *<sup>p</sup>*+*<sup>q</sup>* 2 .

**Lemma 1.** *Let p* = *D<sup>p</sup>* + √ *N and q* = √ *<sup>N</sup>* <sup>−</sup> *<sup>D</sup>q. If <sup>h</sup><sup>t</sup> kt be a fraction from the convergent list of continued fraction* √ 1 *N with h<sup>t</sup>* <sup>&</sup>lt; *<sup>D</sup><sup>p</sup>* <sup>−</sup> *<sup>D</sup>q, then* <sup>√</sup> *N* + *ht* <sup>2</sup> < *p*+*q* 2 *.*

**Proof.** Suppose there exist *<sup>h</sup><sup>t</sup> kt* from convergent list of continued fraction √ 1 *N* . If *h<sup>t</sup>* < *<sup>D</sup><sup>p</sup>* <sup>−</sup> *<sup>D</sup>q*, then substituting the Equation (2) into <sup>√</sup> *N* + *ht* 2 , thus it can be rewritten as follows.

$$\begin{aligned} \sqrt{N} + \frac{h\_t}{2} &= \frac{2\sqrt{N} + h\_t}{2} \\ &< \frac{2\sqrt{N} + D\_p - D\_q}{2} \\ &= \frac{p + q}{2} \end{aligned}$$

**Observation 1.** *Consider Lemma 1. As* <sup>√</sup> *N* + *ht* 2 *is always smaller than <sup>p</sup>*+*<sup>q</sup>* 2 *, therefore <sup>h</sup><sup>t</sup>* 2 *via EPF can be use as the additional extension λ. Furthermore, we can set x* = √ *N* + *ht* 2 *to serve as the (improved) initial value of x in the FFA1 algorithm.*

In this study, we discover two types of possible selected convergent on the list. The first is Type 1: *<sup>h</sup><sup>t</sup> kt* from the convergent list of √ 1 *N* where *k<sup>t</sup>* . *N*. For Type 1, Wu et al. [1] mentioned that *<sup>h</sup><sup>t</sup> kt* can be an indicator of convergent with index *t* to select the good candidate for initial value.

$$\left[ \frac{h\_1}{k\_1}, \dots, \frac{h\_t}{k\_t}, \frac{h\_{t+1}}{k\_{t+1}}, \dots, \frac{h\_i}{k\_i} \right] \tag{3}$$

Next, is the Type 2: *<sup>h</sup>* 0 *i k* 0 *i* where it is the last convergent on the list, which will be elaborated further. Thus, *h* 0 *<sup>i</sup>* will be selected for additional extension for potential initial value.

$$\left[\frac{h\_1}{k\_1}, \frac{h\_2}{k\_2}, \dots, \frac{h\_i'}{k\_i'}\right] \tag{4}$$

The behaviour of Type 2 convergent selection is analysed via experiment on 50 distinct balanced prime moduli *N*. Figure 3 shows there are three possible positions on the potential initial value with additional extension that close to *<sup>p</sup>*+*<sup>q</sup>* 2 as follows.

**Figure 3.** Three possible situations of Type 2 convergent selection.

**Observation 2.** *Figure 3 shows that Situation 1 happens when the value of <sup>p</sup>*+*<sup>q</sup>* 2 *is larger than the two initial values, Situation 2 shows that position <sup>p</sup>*+*<sup>q</sup>* 2 *is in the middle of potential initial values, and Situation 3 is where the value of <sup>p</sup>*+*<sup>q</sup>* 2 *is smaller than the potential initial values.*

The result via experimental analysis of 50 distinct moduli *N* indicates that 33 moduli are considered as Type 2 convergent selection and 42.42% of them have potential initial value with additional extension *h* 0 *i*−3 and *h* 0 *i*−4 larger than *<sup>p</sup>*+*<sup>q</sup>* 2 (i.e., Situation 3), while the rest might fall under Situation 1 or Situation 2. This experimental analysis suggests that the position of those potential initials values close to *<sup>p</sup>*+*<sup>q</sup>* 2 is undecidable. This study aims to provide solutions that covered unbalanced and balanced primes for all three situations. The in-depth analysis of Type 2 will be discussed in Section 4.3.

## **4. Results and Discussion**

By the observation in Section 3, an improvement for FFA1 to increase the effectiveness of the algorithm to factor a modulus *N* is sought after. The aim is to focus on three important parts: (i) EPF's technique is used, (ii) establishing potential initial values for FFA1, and (iii) the alteration on the original FFA1 algorithm to suit the said potential initial value. Thus, we introduce mFFA1-EPF to improve the effectiveness of FFA1.

## *4.1. mFFA1-EPF: Unbalanced Prime*

This section is dedicated for the modulus *N* with unbalanced primes factor. From Observation 1, we setting *x* = d √ *N* + *ht* 2 e as the improved initial value in FFA1 algorithm is which less than *<sup>p</sup>*+*<sup>q</sup>* 2 . Let us start with the following question.

**Question 1.** *Is there any other potential initial values aside from x* = √ *N* + *ht* 2 *that can be selected to shorten the distance towards <sup>p</sup>*+*<sup>q</sup>* 2 *?*

**Answer.** Interestingly, if we can find other candidates for initial values in which potentially reduces the distance *dnew*, then it can be useful to reduce the loop count to search for

the value of *<sup>p</sup>*+*<sup>q</sup>* 2 . Suppose the value of <sup>√</sup> *N* + *h<sup>t</sup>* is considered as the other candidate of potential values (i.e., the value of *λ*). Then, such value is supposed close to *<sup>p</sup>*+*<sup>q</sup>* 2 . However, in general, it position is undecided whether <sup>√</sup> *N* + *h<sup>t</sup>* is larger (as shown by Figure 4) or is smaller than *<sup>p</sup>*+*<sup>q</sup>* 2 (i.e., similar to Lemma 1).

$$\begin{array}{c|c} \\ \hline \\ & p+q \\ & 2 \\ \end{array} \qquad \begin{array}{c|c} \\ \\ & \sqrt{N}+h\_t \\ \\ \end{array}$$

**Figure 4.** The position of <sup>√</sup> *N* + *h<sup>t</sup>* > *p*+*q* 2 .

Next, suppose there exist *<sup>h</sup>t*−<sup>1</sup> such that *<sup>h</sup>t*−<sup>1</sup> *kt*−<sup>1</sup> is from the convergent list of √ 1 *N* with *<sup>k</sup>t*−<sup>1</sup> <sup>&</sup>lt; *<sup>k</sup><sup>t</sup>* . *<sup>N</sup>*. By empirical evident, the value <sup>√</sup> *N* + *ht*−<sup>1</sup> < *p*+*q* 2 can be considered as another candidate of additional extension for FFA1. Based on the empirical evident, it shows that the position of <sup>√</sup> *N* + *ht* 2 and <sup>√</sup> *N* + *ht*−<sup>1</sup> are unpredictable as illustrated in Figure 5a,b.

(**a**) The position with <sup>√</sup> *N* + *ht* <sup>2</sup> < √ *<sup>N</sup>* + *<sup>h</sup>t*−<sup>1</sup>

(**b**) The position with <sup>√</sup> *N* + *ht* <sup>2</sup> > √ *<sup>N</sup>* + *<sup>h</sup>t*−<sup>1</sup>

**Figure 5.** The possible position between <sup>√</sup> *N* + *ht* 2 , √ *<sup>N</sup>* + *<sup>h</sup>t*−<sup>1</sup> , √ *N* + *h<sup>t</sup>* , and *<sup>p</sup>*+*<sup>q</sup>* 2

There are three candidates for the potential values for *λ*: *ht* 2 , *ht*−1, and *h<sup>t</sup>* . In answering Question 1, the algorithm will start to compute the first value, *<sup>a</sup>*<sup>1</sup> <sup>=</sup> max d √ *N* + *ht*−1e, d √ *N* + *ht* 2 e , while the second value is *a*<sup>2</sup> = d √ *N* + *ht*e. Once *a*<sup>1</sup> and *a*<sup>2</sup> are established, then two values—*y*<sup>1</sup> = q *a* 2 <sup>1</sup> − *N* and *y*<sup>2</sup> = q *a* 2 <sup>2</sup> − *N*—are computed.

.

After establishing the values of *a*'s and *y*'s, three procedures run simultaneously:


**Remark 3.** *Note that the same value a*<sup>2</sup> *is applied in both Procedure 2 and Procedure 3. As the value <sup>a</sup>*2, *might be larger than <sup>p</sup>*+*<sup>q</sup>* 2 *, the main role of Procedure 3 is to prevent the mFFA1-EPF algorithm to keep running forever. All of the procedure is done by parallel computing, which means that the algorithm will completely be stopped whenever one of the procedures outputs y*<sup>1</sup> *or y*<sup>2</sup> *as an integer. Eventually, p and q will be obtained.*

Unbalanced prime is demonstrated in Algorithm 1 as follows and flowchart on Figure A1 in Appendix B.1.

**Algorithm 1:** mFFA1-EPF: Unbalanced Prime **Input:** Modulus *N* **Output:** The prime *p* and *q* **<sup>1</sup>** Compute the continued fraction of √ 1 *N* . **<sup>2</sup>** Select *<sup>h</sup>t*−<sup>1</sup> *kt*−<sup>1</sup> and *<sup>h</sup><sup>t</sup> kt* which is convergence to √ 1 *N* , where *kt*−<sup>1</sup> < *k<sup>t</sup>* < *N* **<sup>3</sup>** Compute *<sup>a</sup>*<sup>1</sup> <sup>=</sup> max d √ *N* + *ht*−1e, d √ *N* + *ht* 2 e and *a*<sup>2</sup> = d √ *N* + *ht*e **<sup>4</sup>** Compute *y*<sup>1</sup> = q *a* 2 <sup>1</sup> − *N* and *y*<sup>2</sup> = q *a* 2 <sup>2</sup> − *N* **5 do in parallel <sup>6</sup>** *Procedure 1*: **while** *y*<sup>1</sup> 6= integer **do <sup>7</sup>** Compute *a*<sup>1</sup> ← *a*<sup>1</sup> + 1 **<sup>8</sup>** Compute *y*<sup>1</sup> = q *a* 2 <sup>1</sup> − *N* **9 end while <sup>10</sup>** *p* = *a*<sup>1</sup> + *y*<sup>1</sup> and *q* = *a*<sup>1</sup> − *y*<sup>1</sup> **11 <sup>12</sup>** *Procedure 2*: **while** *y*<sup>2</sup> 6= integer **do <sup>13</sup>** Compute *a*<sup>2</sup> ← *a*<sup>2</sup> + 1 **<sup>14</sup>** Compute *y*<sup>2</sup> = q *a* 2 <sup>2</sup> − *N* **<sup>15</sup> end while <sup>16</sup>** *p* = *a*<sup>2</sup> + *y*<sup>2</sup> and *q* = *a*<sup>2</sup> − *y*<sup>2</sup> **17 <sup>18</sup>** *Procedure 3*: **while** *y*<sup>2</sup> 6= integer **do <sup>19</sup>** Compute *a*<sup>2</sup> ← *a*<sup>2</sup> − 1 **<sup>20</sup>** Compute *y*<sup>2</sup> = q *a* 2 <sup>2</sup> − *N* **<sup>21</sup> end while <sup>22</sup>** *p* = *a*<sup>2</sup> + *y*<sup>2</sup> and *q* = *a*<sup>2</sup> − *y*<sup>2</sup> **<sup>23</sup> return** *(p,q)*

**Remark 4.** *For the Type 2 case, Step 2 on Algorithm 1 is changed by selecting <sup>h</sup>* 0 *i k* 0 *i and <sup>h</sup>* 0 *i*−1 *k* 0 *i*−1 *. Beside that, Step <sup>3</sup> will be modified with a*<sup>1</sup> <sup>=</sup> max d √ *N* + *h* 0 *i*−1 e, d √ *N* + *h* 0 *i* 2 e *and a*<sup>2</sup> = d √ *N* + *h* 0 *i* e*.*

Examples 1–4 are presented as illustrations of mFFA1-EPF for unbalanced primes. Example 1 demonstrates Type 1 of convergent-type selection, while Example 2 demonstrates Type 2 of convergent-type selection. Example 3 shows the importance of *a*<sup>2</sup> for this algorithm, and Example 4 shows the application of a previous example from Wu et al. [1].

**Example 1.** *Let N* = 707,896,463*. By the continued fraction method, the following list of fraction* √ 1 *N is created*

$$\left[\dots , \sqrt{\frac{34}{904,615}} , \frac{139}{3,698,279} , \frac{33,811}{899,586,412} , \dots \right]$$

*We select* <sup>139</sup> 3,698,279 *as a candidate of <sup>h</sup><sup>t</sup> kt and* <sup>34</sup> 904,615 *as a candidate of <sup>h</sup>t*−<sup>1</sup> *kt*−<sup>1</sup> *since kt*−<sup>1</sup> < *k<sup>t</sup>* . *N. Now, the potential initial values are computed as follows:*


*With a*<sup>1</sup> = 26,676 *and a*<sup>2</sup> = 26,746*, Algorithm 1 is performed. The algorithm is stopped when Procedure 2 satisfy the searching on Algorithm 1 where y*<sup>2</sup> *become an integer* (*y*<sup>2</sup> = 66,126)*. Finally, p* = *a*<sup>2</sup> + *y*<sup>2</sup> = 50,359 *and q* = *a*<sup>2</sup> − *y*<sup>2</sup> = 14,057 *are computed.*

**Example 2.** *Let N* = 7,665,365,527,725,431*. By continued fraction method, the following convergent list of* √ 1 *N is created*

$$\left[ \dots , \sqrt{\frac{1273}{111, 453, 789, 228}} , \frac{1554}{136, 055, 921, 807} , \frac{2827}{247, 509, 711, 035} \right]$$

2827 247,509,711,035 *is selected as <sup>h</sup>* 0 *i k* 0 *i (the last convergent on the list) and* <sup>1554</sup> 136,055,921,807 *as h* 0 *i*−1 *k* 0 *i*−1 *as k* 0 *<sup>i</sup>* < *N. The potential initial values are computed as follows:*

*1. <sup>a</sup>*<sup>1</sup> <sup>=</sup> max d √ *N* + *h* 0 *i*−1 e, d √ *N* + *h* 0 *i* 2 e = 87,553,628 *2. a*<sup>2</sup> = d √ *N* + *h* 0 *i* e = 87,554,901

*With a*<sup>1</sup> = 87,553,628 *and a*<sup>2</sup> = 87,554,901*, Algorithm 1 is performed. The algorithm is stopped when Procedure 2 satisfies the searching on Algorithm 1 where y*<sup>2</sup> *become an integer* (*y*<sup>2</sup> = 96,393,384)*. Last, p* = *a*<sup>2</sup> + *y*<sup>2</sup> = 136,721,029 *and q* = *a*<sup>2</sup> − *y*<sup>2</sup> = 56,065,739 *are computed.*

**Example 3.** *Suppose N* = 2,927,489,533*. By continued fraction method, the following convergent list of* √ 1 *N is created*

$$\left[\dots , \sqrt{\frac{5329}{288 \text{,} 332 \text{,} 366}} , \frac{7097}{383 \text{,} 992 \text{,} 269} , \frac{12426}{672 \text{,} 324 \text{,} 635}\right]$$

12,426 672,324,635 *is selected as <sup>h</sup>* 0 *i k* 0 *i and* <sup>7097</sup> 383,992,269 *as h* 0 *i*−1 *k* 0 *i*−1 *as k* 0 *<sup>i</sup>* < *N. Now, the potential initial values are computed as follows:*

*1. <sup>a</sup>*<sup>1</sup> <sup>=</sup> max d √ *N* + *h* 0 *i*−1 e, d √ *N* + *h* 0 *i* 2 e = 61,204 *2. a*<sup>2</sup> = d √ *N* + *h* 0 *i* e = 66,533

*With a*<sup>1</sup> = 29,682 *and a*<sup>2</sup> = 36,369*, Algorithm 1 is performed. The algorithm is stopped when Procedure 3 satisfies the searching on Algorithm 1 where y*<sup>2</sup> *become integer* (*y*<sup>2</sup> = 65,903)*. Last, p* = *a*<sup>2</sup> + *y*<sup>2</sup> = 49,307 *and q* = *a*<sup>2</sup> − *y*<sup>2</sup> = 10,723 *are computed.*

**Example 4.** *Suppose N* = 1,783,647,329 *which adapted from the numerical example of Wu et al. [1]. By continued fraction method, the following convergent list of* √ 1 *N is created*

$$\left[\dots , \sqrt{\frac{2758}{11,6479,301}} , \frac{10,205}{430,990,307} , \frac{12,963}{547,469,608} , \dots \right]$$

12,963 547,469,608 *is selected as a candidate of <sup>h</sup><sup>t</sup> kt and* 10,205 430,990,307 *as a candidate of <sup>h</sup>t*−<sup>1</sup> *kt*−<sup>1</sup> *since kt*−<sup>1</sup> < *k<sup>t</sup>* < *N. Two potential initial values are computed as follows:*

*1. <sup>a</sup>*<sup>1</sup> <sup>=</sup> max d √ *N* + *ht*−1e, d √ *N* + *ht* 2 e = 52,439 √

$$\text{2.} \quad a\_2 = \lceil \sqrt{N} + h\_t \rceil = 55 \text{`} 197$$

*Algorithm* 1 *is performed and stop when Procedure 1 satisfies that y*<sup>1</sup> *is an integer* (*y*<sup>1</sup> = 31,664)*. Last, p* = *a*<sup>1</sup> + *y*<sup>1</sup> = 84,449 *and q* = *a*<sup>1</sup> − *y*<sup>1</sup> = 21,121 *are computed.*

#### *4.2. Discussion of Algorithm 1 (mFFA1-EPF: Unbalanced Primes)*

This section presents a comparative analysis between the mFFA1-EPF and the previous technique, based on loop count and computational time. Note that all experimental results were using a computer running on 2.1 GHz on Intel® Core i3 with 4 GB of RAM.

According to Table 2, the loop count on Procedure 2 of Examples 1 and 2 is the shortest one. It shows that the mFFA1-EPF has the smallest loop count compared to the other methods. For Example 3, Procedure 3 has the shortest path (630), at the same time it indicates <sup>√</sup> *N* + *h<sup>t</sup>* is larger than *<sup>p</sup>*+*<sup>q</sup>* 2 . Thus far, the results give a good visualization representing the factorization of modulus *N* by our method experimentally. Example 4 is the same example as given in Wu et al. [1]. Procedure 1 obtained the smallest loop count (346) compared to other methods. This N/A (Not Applicable) means that all the procedure in Algorithm 1 is stopped when one of the *y*'s from any procedures got the integer first.

**Table 2.** The comparison on loop count between FFA1, FFA2, and FFA2-EPF with our propose method toward Examples 1–4.


The mFFA1-EPF is performed by parallel computing, which means Procedures 1–3 were run simultaneously. We recorded the computational time for the three different procedures. In Table 3, mFFA1-EPF is faster than FFA1 by 4 numerical examples. This seems to be a slight improvement for FFA1 as there is an involvement of additional extension. mFFA1-EPF is good in term of loop count, running without failure and computational time (compared to FFA1).

**Table 3.** The comparison on computational time in second (s) between FFA1, FFA2, and FFA2-EPF with mFFA1-EPF toward Examples 1–4.


To make mFFA1-EPF more convincing, there are numerical examples from Somsuk [6] provided in Table 4.

**Table 4.** The comparison loop count between FFA1, FFA2, FFA2-EPF, and FFA-Euler with our proposed method by Somsuk's Example (Tables 2 and 3 [6]).


This comparison highlights the improvement made by mFFA1-EPF compared to the method FFA-Euler, provided by Somsuk [6]. By two examples from in [6], the exhaustive search is improved with shortest loop counts, and the potential initial values are shorter than the FFA-Euler loop count. This shows that our method can be compatible with all unbalanced prime.

#### *4.3. mFFA1-EPF: Balanced Prime*

Previously, in the case of a modulus with unbalanced primes, three candidates are determined as the *λ*. Only two potential initial values that possibly shorten the *dnew* were selected. In this section, we will explore the case of a modulus with balanced primes. The aim is to dictate the proper candidates from the convergent list (i.e., mFFA1-EPF) for the potential initial values via a similar approach.

Recall that *<sup>h</sup><sup>t</sup> kt* with index *t*, where *k<sup>t</sup>* . *N*, deemed as the indicator for selecting a good convergent to additional extension of initial values for FFA2-EPF [1]. When the EPF technique applied for the balanced prime case on FFA1, by empirical evidence, it shows that such indicator leads to the initial value (<sup>√</sup> *N* + *ht*) relatively far away from the target value (exceeded by *<sup>p</sup>*+*<sup>q</sup>* 2 ). Therefore, we conjecture that the EPF method seems not to be an effective method to factor the modulus *N* with a balanced prime. Furthermore, the result in Somsuk [5] agreed that EPF is only suitable for unbalanced prime. It failed to address the convergent with index *t* as a suitable index to improve the initial value.

Therefore, in this section, we provide the strategies to address such drawback of factoring the modulus *N* with a balanced prime by imposing modification in the mFFA1- EPF algorithm. The strategies involve convergent selection and modification of potential initial values. Therefore, to enhance the effectiveness via mFFA1-EPF, the additional extension *h<sup>t</sup>* until *ht*−<sup>5</sup> is observe empirically to determine the smallest value of *dnew*. The result of the observation is presented on Figure 6, and the discussion follows.

Suppose *<sup>h</sup><sup>t</sup> kt* with index *t* where *k<sup>t</sup>* . *N* is chosen via EPF. Note that for the modulus *<sup>N</sup>* with balanced primes case, the value <sup>√</sup> *<sup>N</sup>* <sup>+</sup> *<sup>h</sup><sup>t</sup> <sup>p</sup>*+*<sup>q</sup>* 2 . Therefore, the additional extension from *<sup>h</sup><sup>t</sup>* to *<sup>h</sup>t*−<sup>5</sup> is analysed. Interestingly, the additional extension <sup>√</sup> *N* + *ht*−*<sup>j</sup>* for *j* = 0, 1, 2, 3, 4, 5 can be a potential initial values as it moves closer to the value of *<sup>p</sup>*+*<sup>q</sup>* 2 . Figure 6a–f shows comparison of potential initial values between the additional extension of <sup>√</sup> *N* + *ht*−*<sup>j</sup>* for *j* = 0, 1, 2, 3, 4, 5 and *<sup>p</sup>*+*<sup>q</sup>* 2 , respectively. The potential initial values decrease, because the value of additional extension is become smaller from *h<sup>t</sup>* to *ht*−<sup>5</sup> (i.e., *h<sup>t</sup>* > *ht*−<sup>1</sup> > *ht*−<sup>2</sup> > *ht*−<sup>3</sup> > *ht*−<sup>4</sup> > *ht*−5).

#### **Question 2.** *What are the suitable initial values that need to be implemented on mFFA1-EPF with balanced primes?*

**Answer.** Based on Figure 6, the line graph between the initial value (represented by the blue dots) starts closer to the target value *<sup>p</sup>*+*<sup>q</sup>* 2 (represented by the red dots) as the value of initial values changes. A hindrance to the development process for this approach is that we can not determine the smallest value of *dnew* via additional extension *h<sup>t</sup>* to *ht*−5. In other words, the "closeness" of the potential initial values with additional extension unable to be decided simply from the results of Figure 6. This is because the additional extension a random value from the generation of convergent list of √ 1 *N* . It requests further analysis. A statistical analysis of 50 distinct moduli *N* with balanced primes is conducted to determine the closeness of potential initial value through index *t* to *t* − 5, as follows.

In this work, a measurement called Mahalanobis Distance (MD) is implemented. MD is the distance between two points in multivariate space. According to Çakmakçı et al. [15], MD measures the distance between a multidimensional point of probability distribution and distribution of distance. The smaller the value of MD, the closer the mean of candidate of potential initial values to the mean of the target value.

(**a**) The Comparison <sup>√</sup> *<sup>N</sup>* + *<sup>h</sup><sup>t</sup>* and *<sup>p</sup>*+*<sup>q</sup>* 2

(**c**) The Comparison <sup>√</sup> *<sup>N</sup>* <sup>+</sup> *<sup>h</sup>t*−<sup>2</sup> and *<sup>p</sup>*+*<sup>q</sup>*

(**b**) The Comparison <sup>√</sup> *<sup>N</sup>* <sup>+</sup> *<sup>h</sup>t*−<sup>1</sup> and *<sup>p</sup>*+*<sup>q</sup>* 2

(**d**) The Comparison <sup>√</sup> *<sup>N</sup>* <sup>+</sup> *<sup>h</sup>t*−<sup>3</sup> and *<sup>p</sup>*+*<sup>q</sup>* 2

(**e**) The Comparison <sup>√</sup> *<sup>N</sup>* <sup>+</sup> *<sup>h</sup>t*−<sup>4</sup> and *<sup>p</sup>*+*<sup>q</sup>* 2

for MD between <sup>√</sup>

(**f**) The Comparison <sup>√</sup> *<sup>N</sup>* <sup>+</sup> *<sup>h</sup>t*−<sup>5</sup> and *<sup>p</sup>*+*<sup>q</sup>* 2

**Figure 6.** The comparison value of 50 data (distinct modulus *N*) between *h<sup>t</sup>* , *<sup>h</sup>t*−<sup>1</sup> , *<sup>h</sup>t*−2, *<sup>h</sup>t*−3, *<sup>h</sup>t*−<sup>4</sup> and *<sup>p</sup>*+*<sup>q</sup>* 2 . **Figure 6.** The comparison value of 50 data (distinct modulus *N*) between *h<sup>t</sup>* , *<sup>h</sup>t*−<sup>1</sup> , *<sup>h</sup>t*−2, *<sup>h</sup>t*−3, *<sup>h</sup>t*−<sup>4</sup> and *<sup>p</sup>*+*<sup>q</sup>* 2 .

2

In the one-dimensional case on the mFFA1-EPF for balanced prime, MD is used to calculate the normalized distance between the mean of each *h<sup>t</sup>* to *ht*−<sup>5</sup> and the mean of the target value *<sup>p</sup>*+*<sup>q</sup>* 2 . The measurement formula is In the one-dimensional case on the mFFA1-EPF for balanced prime, MD is used to calculate the normalized distance between the mean of each *h<sup>t</sup>* to *ht*−<sup>5</sup> and the mean of the target value *<sup>p</sup>*+*<sup>q</sup>* 2 . The measurement formula is

$$\text{MD} = \frac{|\mu\_{IV} - \mu\_{AV}|}{\sigma\_{IV+AV}} \tag{5}$$

 

MD = *σIV*+*AV* (5) where *<sup>µ</sup>IV* is a mean of each data potential initial value of <sup>√</sup> *<sup>N</sup>* <sup>+</sup> *<sup>h</sup><sup>t</sup>* to <sup>√</sup> *N* + *ht*−<sup>5</sup> while *<sup>µ</sup>AV* is mean of actual value *<sup>p</sup>*+*<sup>q</sup>* 2 . The value *σIV*+*AV* is calculated from combination data from potential initial value and the actual value *<sup>p</sup>*+*<sup>q</sup>* . The following formula is represented where *<sup>µ</sup>IV* is a mean of each data potential initial value of <sup>√</sup> *<sup>N</sup>* <sup>+</sup> *<sup>h</sup><sup>t</sup>* to <sup>√</sup> *N* + *ht*−<sup>5</sup> while *<sup>µ</sup>AV* is mean of actual value *<sup>p</sup>*+*<sup>q</sup>* 2 . The value *σIV*+*AV* is calculated from combination data from potential initial value and the actual value *<sup>p</sup>*+*<sup>q</sup>* 2 . The following formula is represented for MD between <sup>√</sup> *<sup>N</sup>* + *<sup>h</sup><sup>t</sup>* and *<sup>p</sup>*+*<sup>q</sup>* 2 ,

$$\mathbf{MD}\_{\left(\sqrt{N} + h\_{l}\right)} = \frac{\left|\mu\_{IV}(\sqrt{N} + h\_{l}) - \mu\_{AV}\right|}{\sigma\_{IV + AV}}$$

2

MD( √ *N*+*ht*) = *σIV*+*AV* We calculate MD for <sup>√</sup> *<sup>N</sup>* <sup>+</sup> *<sup>h</sup><sup>t</sup>* to <sup>√</sup> *N* + *ht*−<sup>5</sup> by same data of 50 moduli *N* and represent the MD value on Table 5.

We calculate MD for <sup>√</sup> *<sup>N</sup>* <sup>+</sup> *<sup>h</sup><sup>t</sup>* to <sup>√</sup> *N* + *ht*−<sup>5</sup> by same data of 50 moduli *N* and represent the MD value on Table 5. Table 5 shows the comparison of the MD index of "closer distance" between several potential initial values from *<sup>h</sup><sup>t</sup>* to *<sup>h</sup>t*−<sup>5</sup> and *<sup>p</sup>*+*<sup>q</sup>* 2 . Table 5 reported that the MD index value

of <sup>√</sup> *<sup>N</sup>* <sup>+</sup> *<sup>h</sup>t*−<sup>3</sup> and <sup>√</sup> *N* + *ht*−<sup>4</sup> are the smallest MD values among other potential initial values, that is, 0.0114 and 0.0116, respectively. It means that <sup>√</sup> *<sup>N</sup>* <sup>+</sup> *<sup>h</sup>t*−<sup>3</sup> and <sup>√</sup> *N* + *ht*−<sup>4</sup> are the most suitable candidates for potential initial values, because they have the smallest *dnew* on average with respect to MD measurement.

**Table 5.** The comparison on the closeness of value between the candidate of potential initial value with *<sup>p</sup>*+*<sup>q</sup>* 2 by MD.


**Observation 3.** *The candidates* <sup>√</sup> *<sup>N</sup>* <sup>+</sup> *<sup>h</sup>t*−<sup>3</sup> *and* <sup>√</sup> *N* + *ht*−<sup>4</sup> *have the smallest value of MD. Therefore, it is highly suggested to select convergents with index t* − 3 *and t* − 4 *to improve the initial values.*

Based on Observation 3, two potential initial values are set as follows:

1. *b*<sup>1</sup> = √ *N* + *ht*−<sup>3</sup> 2. *b*<sup>2</sup> = √ *N* + *ht*−<sup>4</sup>

Remark that the FFA1 algorithm requires an initial value less than the target value and will keep increasing by one (i.e., +1) until it reaches *<sup>p</sup>*+*<sup>q</sup>* 2 . Therefore, in mFFA1-EPF, we use the variation technique, which means the value of *b*<sup>1</sup> and *b*<sup>2</sup> need to be increased and decreased by 1 simultaneously. Next, the following values are established:

$$\begin{array}{ll} 1. & y\_1 = \sqrt{b\_1^2 - N} \\ 2. & y\_2 = \sqrt{b\_2^2 - N} \end{array}$$

Four procedures are introduced using the above values, with the variation technique as follows:


Note that these four procedures were run simultaneously by parallel computing which will stop when one of the *y*'s become the first integer. Algorithm 2 shows how the workflow runs.

**Algorithm 2:** mFFA1-EPF: Balanced Prime **Input:** Modulus *N* **Output:** The prime *p* and *q* **<sup>1</sup>** Compute the continued fraction of √ 1 *N* . **<sup>2</sup>** Select *<sup>h</sup>t*−<sup>3</sup> *kt*−<sup>3</sup> and *<sup>h</sup>t*−<sup>4</sup> *kt*−<sup>4</sup> which is convergence to √ 1 *N* , where *kt*−<sup>4</sup> < *kt*−<sup>3</sup> < *N* **<sup>3</sup>** Compute *b*<sup>1</sup> = d √ *N* + *ht*−3e and *b*<sup>2</sup> = d √ *N* + *ht*−4e **<sup>4</sup>** Compute *y*<sup>1</sup> = q *b* 2 <sup>1</sup> − *N* and *y*<sup>2</sup> = q *b* 2 <sup>2</sup> − *N* **5 do in parallel <sup>6</sup>** Procedure 1: **while** *y*<sup>1</sup> 6= integer **do <sup>7</sup>** Compute *b*<sup>1</sup> ← *b*<sup>1</sup> + 1 **<sup>8</sup>** Compute *y*<sup>1</sup> = q *b* 2 <sup>1</sup> − *N* **9 end while <sup>10</sup>** Compute *p* = *b*<sup>1</sup> + *y*<sup>1</sup> and *q* = *b*<sup>1</sup> − *y*<sup>1</sup> **11 <sup>12</sup>** Procedure 2: **while** *y*<sup>2</sup> 6= integer **do <sup>13</sup>** Compute *b*<sup>2</sup> ← *b*<sup>2</sup> + 1 **<sup>14</sup>** Compute *y*<sup>2</sup> = q *b* 2 <sup>2</sup> − *N* **<sup>15</sup> end while <sup>16</sup>** Compute *p* = *b*<sup>2</sup> + *y*<sup>2</sup> and *q* = *b*<sup>2</sup> − *y*<sup>2</sup> **17 <sup>18</sup>** Procedure 3: **while** *y*<sup>1</sup> 6= integer **do <sup>19</sup>** Compute *b*<sup>1</sup> ← *b*<sup>1</sup> − 1 **<sup>20</sup>** Compute *y*<sup>1</sup> = q *b* 2 <sup>1</sup> − *N* **<sup>21</sup> end while <sup>22</sup>** Compute *p* = *b*<sup>1</sup> + *y*<sup>1</sup> and *q* = *b*<sup>1</sup> − *y*<sup>1</sup> **23 <sup>24</sup>** Procedure 4: **while** *y*<sup>2</sup> 6= integer **do <sup>25</sup>** Compute *b*<sup>2</sup> ← *b*<sup>2</sup> − 1 **<sup>26</sup>** Compute *y*<sup>2</sup> = q *b* 2 <sup>2</sup> − *N* **<sup>27</sup> end while <sup>28</sup>** Compute *p* = *b*<sup>2</sup> + *y*<sup>2</sup> and *q* = *b*<sup>2</sup> − *y*<sup>2</sup> **<sup>29</sup> return** *(p*, *q)*

#### *4.4. Discussion on Algorithm 2 (mFFA1-EPF: Balanced Primes)*

Algorithm 2 is also illustrated as a flowchart in Figure A2 in Appendix B.2. The experimental result is represented using the mFFA1-EPF via balanced prime on Example 5 while applying mFFA1-EPF is represented on Somsuk's numerical example [6] in Example 6.

**Example 5.** *(Procedure 1 satisfies on Example 5). Let N* = 616,696,115,591*. By continued fraction method, the following convergent list* √ 1 *N is created*

$$\left[\dots\dots\Big.\frac{61}{47,903,301},\frac{123}{96,591,902},\frac{184}{144,495,203},\frac{491}{385,582,308},\dots\right]$$

491 385,582,308 *is selected as a candidate of <sup>h</sup>t*−<sup>3</sup> *kt*−<sup>3</sup> *and* <sup>184</sup> 144,495,203 *as a candidate of <sup>h</sup>t*−<sup>4</sup> *kt*−<sup>4</sup> *since kt*−<sup>4</sup> < *kt*−<sup>3</sup> < *k<sup>t</sup>* . *N. Now, there are two candidates of potential initial value and 2 y's are computed as follows:*


*Algorithm 2 is performed because y*<sup>1</sup> *and y*<sup>2</sup> *not integers. The values b*<sup>1</sup> *and b*<sup>2</sup> *in Procedure 1 and 2 are increased by 1 while initial values in Procedure 3 and 4 are decreased by 1. The algorithm stop where y*<sup>1</sup> *from Procedure 1 is integer* (*y*<sup>1</sup> = 801,204)*. Finally, compute p* = *b*<sup>1</sup> + *y*<sup>1</sup> = 960,049 *and q* = *b*<sup>1</sup> − *y*<sup>1</sup> = 642,359*.*

**Example 6.** *[6] (Procedure 1 satisfies on Example 6) Say N* = 340,213*. By continued fraction method, a list of fraction* √ 1 *N is created*

$$\left[\dots , \dots , \frac{1}{583} , \frac{3}{1750} , \frac{4}{2333} , \frac{7}{4083} , \dots \right]$$

3 <sup>1750</sup> *is selected as a candidate of <sup>h</sup>t*−<sup>3</sup> *kt*−<sup>3</sup> *and* <sup>1</sup> <sup>583</sup> *as a candidate of <sup>h</sup>t*−<sup>4</sup> *kt*−<sup>4</sup> *since kt*−<sup>4</sup> < *kt*−<sup>3</sup> < *k<sup>t</sup>* < *N. Now, there are two candidates of potential initial value and 2 y's are computed as follows:*

*1. b*<sup>1</sup> = d √ *N* + *ht*−3e = 587 *2. b*<sup>2</sup> = d √ *N* + *ht*−4e = 585

*Algorithm 2 is performed because y*<sup>1</sup> *and y*<sup>2</sup> *not integer. The values b*<sup>1</sup> *and b*<sup>2</sup> *in Procedure 1 and 2 are increased by 1 while initial values in Procedure 3 and 4 are decreased by 1. The algorithm stop where y*<sup>1</sup> *from Procedure 1 is integer* (*y*<sup>1</sup> = 587)*. We compute p* = *b*<sup>1</sup> + *y*<sup>1</sup> = 653 *and q* = *b*<sup>1</sup> − *y*<sup>1</sup> = 521*.*

Table 6 shows the comparison on count loop and computational time in seconds (s), between several FFAs with our proposed method toward Example 5. The loop count on Procedure 1 (15412) is the least number of loop count compared to previous methods. Besides, FFA2-EPF can not undergo the process and the loop count is not available since the initial value exceeded the value of *p* + *q* and *p* − *q*. When it goes on computational time, the algorithm is not shown the fastest one but it still improves from FFA1.

**Table 6.** The comparison on loop count and computational time in second (s) between several FFAs with our propose method toward Example 5.


For Table 7, mFFA1-EPF is applied toward [6] to compare the loop count and computational time. It shows the shortest loop count even the initial value is exactly the value of *p*+*q* 2 ( *p*+*q* <sup>2</sup> = √ *N* + *ht*−<sup>3</sup> = 587). Using Algorithm 2, the loop count reduced significantly, which result in the exhaustive search to run without fail.

**Table 7.** The comparison on loop count and computational time in second (s) between several FFAs with our propose method toward Example 6 [6].


**Remark 5.** *Consider Type 2 of the continued fraction convergent selection of the modulus N with balanced primes. For Type 2, we use Algorithm 2 with a changes in Step 2 where h* 0 *i*−3 *and h* 0 *i*−4 *and k* 0 *<sup>i</sup>* < *N.*

Now, we replicate a numerical example from the Algorithm 2 with respect to Remark 5 on Example 7.

**Example 7.** *(Procedure 4 satisfies on Example 7). Suppose N* = 9,355,908,869*. By continued fraction method, a convergent list of* √ 1 *N is created*


*As <sup>h</sup>* 0 *i k* 0 *i is the last convergent on the list,* <sup>2009</sup> 194,322,428 *is selected as a candidate of <sup>h</sup>* 0 *i*−3 *k* 0 *i*−3 *and* <sup>1611</sup> 155,825,501 *as a candidate of <sup>h</sup>* 0 *i*−4 *k* 0 *i*−4 *as k* 0 *<sup>i</sup>*−<sup>4</sup> <sup>&</sup>lt; *<sup>k</sup>* 0 *<sup>i</sup>*−<sup>3</sup> <sup>&</sup>lt; *<sup>k</sup>* 0 *<sup>i</sup>* < *N. We compute two candidates of initial value of x, y*<sup>1</sup> *and y*<sup>2</sup> *as follows:*

*1. b*<sup>1</sup> = d √ *N* + *h* 0 *i*−3 e = 98,735 *2. b*<sup>2</sup> = d √ *N* + *h* 0 *i*−4 e = 98,337

*Since y*<sup>1</sup> *and y*<sup>2</sup> *are not integer, the values b*<sup>1</sup> *and b*<sup>2</sup> *in Procedures 1 and 2 are increased by 1 while initial values in Procedures 3 and 4 are decreased by 1. The algorithm stop where y*<sup>2</sup> *in Procedure 4 is integer* (*y*<sup>2</sup> = 97,245)*. We compute p* = *b*<sup>2</sup> + *y*<sup>2</sup> = 107,279 *and q* = *b*<sup>2</sup> − *y*<sup>2</sup> = *87,211.*

Table 8 shows the comparison on loop count and computational time in second between several FFAs with our propose method toward Example 7. Procedure 4 shows the smallest loop count with 1092 compared to FFA1 (1590) and FFA2 (10,553). The loop count of FFA2-EPF is unavailable as the initial values are exceeded the value of *p* + *q* and *p* − *q*. On computational time, our algorithm is slightly better than FFA1. Procedure 4 plays it crucial part to achieve the value *<sup>p</sup>*+*<sup>q</sup>* 2 , and, at the same time, Procedure 4 obtains the indicators of whether the value is larger or smaller than *<sup>p</sup>*+*<sup>q</sup>* 2 . Therefore, Algorithm 2 with Remark 5 helps to search for the value of *<sup>p</sup>*+*<sup>q</sup>* <sup>2</sup> without failure.

**Table 8.** The comparison on loop count and computational time in second (s), between several FFAs with our propose method toward Example 7.


Recall that there is *dnew* = *p*+*q* <sup>2</sup> − ( √ *N* + *λ*). For mFFA1-EPF, the *λ* varies according to type of modulus *N*; *λ* = *h<sup>t</sup>* and *λ* = *ht*−<sup>1</sup> for a modulus with unbalanced primes while *λ* = *ht*−<sup>3</sup> and *λ* = *ht*−<sup>4</sup> for a modulus with balanced primes. Multiple *λ* can lead to the shortest path toward *<sup>p</sup>*+*<sup>q</sup>* 2 . For comparison on mFFA1-EPF with FFA1, both methods use the same process of calculating the square roots to reach the target value *<sup>p</sup>*+*<sup>q</sup>* 2 . However, mFFA1-EPF uses the additional extension on its potential initial values *dnew* < *d*<sup>0</sup> where *d*<sup>0</sup> is the loop count of FFA1. Based on the empirical results in this work, the loop count and computational time of mFFA1-EPF are improved compared to FFA1. Consequently, it reduces the cost of running the exhaustive search.

The uniqueness of FFA2 is that it uses multiplication operation as the main process, it has less cost in computational time compared to the mFFA1-EPF which uses square root operation. Alas, FFA2 requires a greater number of iterations to achieve *p* + *q* and *p* − *q*. In this regard, the mFFA1-EPF uses less cost in terms of computational memory and less space to run the iteration compared to FFA2.

The objective of establishing additional extension on FFA2-EPF is the same as mFFA1- EPF, to shorten the path toward *p* + *q* and *p* − *q*. mFFA1-EPF has a shorter loop count than

FFA2-EPF because the main operation comes from FFA2, which uses a huge number of iteration to achieve its target values: *p* + *q* and *p* − *q*. Thus, mFFA1-EPF requires less cost in terms of space compared to FFA2-EPF.

#### **5. Conclusions and Future Works**

In this study, we discovered two types of convergent list selections. The first is Type 1: *<sup>h</sup><sup>t</sup> kt* is selected from the convergent list of √ 1 *N* where *k<sup>t</sup>* . *N*. For Type 1, Wu et al. [1] mentioned that *<sup>h</sup><sup>t</sup> kt* can be an indicator of convergent with index *t* to select a good candidate for initial value. Next, Type 2: *<sup>h</sup>* 0 *i k* 0 *i* where *k* 0 *i* is the last convergent on the list (as illustrated by Figure 4) where *h* 0 *<sup>i</sup>* will be selected for additional extension for potential initial value. This paper proposed two improved factoring algorithms called mFFA1-EPF for unbalanced primes and mFFA1-EPF for balanced primes. The general idea for designing the algorithms is due to a modification made to EPF and then implemented to (improved) FFA1. The resulting study shows a significant improvement that reduces the loop count of FFA1 via (improved) EPF compared to previous methods (FFA1, FFA2, FFA2-EPF, and FFA-Euler).

An interesting limitation to our work is that the computational time of mFFA1-EPF is still far beyond efficient to factor a modulus with 1024-bit size of balanced primes, with the current technology. We foresee that the mFFA1-EPF might be useful once a large quantum computer with stable qubits is available. The mFFA1-EPF is a type of searching algorithm, thus it might take advantage of making fine adjustments or manipulating the mathematical nature within Grover's searching quantum algorithm [16]. The mFFA1-EPF can be used in machine architecture with low power such as the Internet of Things-based devices, which requires to factor small composites integer [1,9]. Furthermore, we expect mFFA1-EPF to be an assistive tool to increase the effort on the machine learning and artificial intelligence approaches, such as in [17], have been introduced in the literature to deal with similar problems as ours.

**Author Contributions:** Conceptualization, M.A.A.; Formal Analysis, R.R.M.T.; Funding Acquisition, M.A.A.; Investigation, R.R.M.T. and Z.M.; Methodology, R.R.M.T.; Project Administration, M.A.A. and M.R.K.A.; Software, Z.M.; Supervision, M.A.A. and M.R.K.A.; Validation, M.A.A., M.R.K.A. and Z.M.; Writing—Original Draft, R.R.M.T.; Writing—Review & Editing, R.R.M.T. and M.A.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** The present research was supported by Universiti Putra Malaysia under Putra Grant—IPM with project number GP-IPM/2017/9519200.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. The Proving on Estimated Prime Factor (EPF)**

We discuss the original study by Wu et al. [1,9]. There is a distance between *p* and *q* with <sup>√</sup> *N* that written as

$$D\_p = p - \sqrt{N} \tag{A1}$$

$$D\_q = \sqrt{N} - q \tag{A2}$$

Derive (A1) and (A2) to become

$$p = D\_p + \sqrt{N} \tag{A3}$$

$$q = \sqrt{N} - D\_q \tag{A4}$$

Denote that *N* = *pq*. We substitute (A3) to *p* and (A4) to *q* and yield

$$\begin{split} N = pq &= (\sqrt{N} + D\_p)(\sqrt{N} - D\_q) \\ &= N + \sqrt{N}(D\_p - D\_q) - D\_p D\_q \end{split} \tag{A5}$$

*N* is eliminated on the both side which generate Equation (A6) and lead to Equation (A7).

$$N = N + \sqrt{N}(D\_p - D\_q) - D\_p D\_q$$

$$N - N = \sqrt{N}(D\_p - D\_q) - D\_p D\_q$$

$$D\_p D\_q = \sqrt{N}(D\_p - D\_q) \tag{A6}$$

$$\frac{1}{\sqrt{N}} = \frac{D\_p - D\_q}{D\_p D\_q} \tag{A7}$$

We do not have any informations about the value of *D<sup>p</sup>* − *D<sup>q</sup>* and *DpDq*. However, from Equation (A7), √ 1 *N* can be useful to get *D<sup>p</sup>* − *D<sup>q</sup>* and *DpD<sup>q</sup>* as *N* is publically known. Now, a convergent list √ 1 *N* is produced by continued fraction.

Suppose there is a convergent list √ 1 *N* , assign as *<sup>h</sup><sup>i</sup> ki* with *h<sup>i</sup>* , *k<sup>i</sup>* ∈ Z and *i* be a number of the convergent produced. From the continued fraction, we know that *<sup>h</sup><sup>i</sup> ki* → √ 1 *N* as *i* → ∞. As the size *h<sup>i</sup>* and *k<sup>i</sup>* increase as *i* increase, there exist *t* such that

$$h\_t < D\_p - D\_q < h\_{t+1} \tag{A8}$$

We use *h<sup>t</sup>* and *k<sup>t</sup>* to correspond the estimation of *D<sup>p</sup>* and *D<sup>q</sup>* that is

$$\begin{aligned} h\_t &\approx D\_p - D\_q \\ k\_t &\approx D\_p D\_q \end{aligned}$$

The convergent with index *t* be the selection fraction for improving the FFA as it give an advantage on shorten the exhasutive search on *p* + *q* and *p* − *q*.

#### **Appendix B. Flowchart mFFA1-EPF**

*Appendix B.1. mFFA1-EPF on Unbalanced Prime*

**Figure A1.** The flowchart of mFFA1-EPF on unbalanced prime.

*Appendix B.2. mFFA1-EPF on Balanced Prime*

**Figure A2.** The flowchart of mFFA1-EPF on balanced prime.

#### **References**


## *Article* **Improving Multivariate Microaggregation through Hamiltonian Paths and Optimal Univariate Microaggregation**

**Armando Maya-López <sup>1</sup> , Fran Casino <sup>2</sup> and Agusti Solanas 1,\***


**Abstract:** The collection of personal data is exponentially growing and, as a result, individual privacy is endangered accordingly. With the aim to lessen privacy risks whilst maintaining high degrees of data utility, a variety of techniques have been proposed, being microaggregation a very popular one. Microaggregation is a family of perturbation methods, in which its principle is to aggregate personal data records (i.e., microdata) in groups so as to preserve privacy through *k*-anonymity. The multivariate microaggregation problem is known to be NP-Hard; however, its univariate version could be optimally solved in polynomial time using the Hansen-Mukherjee (HM) algorithm. In this article, we propose a heuristic solution to the multivariate microaggregation problem inspired by the Traveling Salesman Problem (TSP) and the optimal univariate microaggregation solution. Given a multivariate dataset, first, we apply a TSP-tour construction heuristic to generate a Hamiltonian path through all dataset records. Next, we use the order provided by this Hamiltonian path (i.e., a given permutation of the records) as input to the Hansen-Mukherjee algorithm, virtually transforming it into a multivariate microaggregation solver we call Multivariate Hansen-Mukherjee (MHM). Our intuition is that good solutions to the TSP would yield Hamiltonian paths allowing the Hansen-Mukherjee algorithm to find good solutions to the multivariate microaggregation problem. We have tested our method with well-known benchmark datasets. Moreover, with the aim to show the usefulness of our approach to protecting location privacy, we have tested our solution with real-life trajectories datasets, too. We have compared the results of our algorithm with those of the best performing solutions, and we show that our proposal reduces the information loss resulting from the microaggregation. Overall, results suggest that transforming the multivariate microaggregation problem into its univariate counterpart by ordering microdata records with a proper Hamiltonian path and applying an optimal univariate solution leads to a reduction of the perturbation error whilst keeping the same privacy guarantees.

**Keywords:** microaggregation; statistical disclosure control; graph theory; traveling salesman problem; data privacy; location privacy

#### **1. Introduction**

Knowledge retrieval and data processing are catalysts for innovation. The continuous advances in information and communication technologies (ICT) and the efficient processing of data allow the extraction of new knowledge by discovering non-obvious patterns and correlations in the data. Nevertheless, such knowledge extraction procedures may threaten individuals' privacy if the proper measures are not implemented to protect it [1–3]. For instance, an attacker may use publicly available datasets to obtain insights about individuals and extract knowledge by exploiting correlations that were not obvious from examining a single dataset [4]. Therefore, before disclosing any data, privacy protection procedures (e.g., anonymization, pseudonymization, aggregation, generalization) must be

**Citation:** Maya-López, A.; Casino, F.; Solanas, A. Improving Multivariate Microaggregation through Hamiltonian Paths and Optimal Univariate Microaggregation. *Symmetry* **2021**, *13*, 916. https:// doi.org/10.3390/sym13060916

Academic Editor: Egon Schulte

Received: 1 April 2021 Accepted: 14 May 2021 Published: 21 May 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

applied. A wide variety of privacy models and protection mechanisms have been proposed in the literature so as to guarantee anonymity (at different levels depending on the utilized model) when disclosing data [5]. Since most privacy protection methods are based on modifying/perturbing/deleting original data, their main drawback is that they negatively affect the utility of the data. Hence, there is a need for finding a proper trade-off between data utility and privacy.

One of the most well-known disciplines studying methods to protect individuals' private information is Statistical Disclosure Control (SDC [6]), which seeks to anonymize microdata sets (i.e., datasets consisting of multiple records corresponding to individual respondents) in a way that it is not possible to re-identify the respondent corresponding to any particular record in the published microdata set. Microaggregation [7], which perturbs microdata sets by aggregating the attributes' values of groups of *k* records so as to reduce re-identification risk by achieving *k*−anonymity, stands out among the most widely used families of SDC methods. It is usually applied by statistical agencies to limit the disclosure of sensitive microdata, and it has been used to protect data in a variety of fields, namely healthcare [8], smart cities [9], or collaborative filtering applications [10], to name a few.

Although the univariate microaggregation problem can be optimally solved in polynomial time, optimal multivariate microaggregation is an NP-hard problem [11]. Thus, finding a solution for the multivariate problem requires heuristic approaches that aim to minimize the amount of data distortion (often measured in terms of information loss), whilst guaranteeing a desired privacy level (typically determined by a parameter *k* that defines the cardinality of the aggregated groups).

#### *1.1. Contribution and Research Questions*

In this article, we propose a novel solution for the multivariate microaggregation problem, inspired by the heuristic solutions of the Traveling Salesman Problem (TSP) and the use of the optimal univariate microaggregation algorithm of Hansen and Mukherjee (HM) [12]. Given an ordered numerical vector, the HM algorithm creates the optimal *k*-partition (i.e., the optimal univariate microaggregation solution). Hence, our intuition is that, if we feed the HM algorithm with a good ordering of the records in a multivariate dataset, it would output a good *k*-partition of the multivariate dataset (although not necessarily optimal).

Ordering the records of a univariate dataset is trivial. However, ordering those records in a multivariate dataset, in which every record has *p* attributes, is not obvious since it is not apparent how to determine the precedence of an element over another. Thus, the primary question is:

#### *Q1: How to create this ordering, when the records are in* R*<sup>p</sup> .*

We suggest that a possible order for the records in R*<sup>p</sup>* is determined by the Hamiltonian path resulting from solving the Traveling Salesman Problem, in which the goal is to find the path that travels through all elements of a set only once, whilst minimizing the total length of the path. Optimally solving the TSP is known to be NP-Hard, but very good heuristic solutions are available. Hence, our intuition is that good heuristic solutions of the TSP (i.e., those with shorter path lengths) would provide a Hamiltonian path, that could be used as an ordered vector for the HM optimal univariate microaggregation algorithm, resulting in a good multivariate microaggregation solution.

The quality of a TSP solution is measured in terms of "path length", the shorter the length the better the solution. However, the quality of the microaggregation is measured in terms of information loss. Given a cardinality parameter *k* (which sets the minimum size of the aggregation clusters), the lower the information loss, the better the microaggregation. Hence, the next questions that we aim to answer are:

*Q2: Are the length of the Hamiltonian path and the information loss of the microaggregation related?, or Do shorter Hamiltonian paths lead to microaggregation solutions with lower information loss?*

and

*Q3: Is the length of the Hamiltonian path the only factor affecting information loss or does the particular construction of the path (regardless of the length) affect the information loss?*

Overall, the key question is:

*Q4: Does this approach provide better solutions (in terms of information loss) than the best performing microaggregation methods in the literature?*

In order to answer these questions, we have tested seven TSP solvers, combined with the HM algorithm (virtually applied in a multivariate manner, or Multivariate HM (MHM)). Particularly, we have tested the "Concorde" heuristic, which, to the best of our knowledge, is the first time it is used for microaggregation. In addition, we have tested well-known classic microaggregation methods (i.e., MDAV and V-MDAV), and an advanced refinement of the former (i.e., MDAV-LK-MHM).

With the aim to test all the aforementioned approaches on a variety of datasets, we have used three classical benchmarks (i.e., Census, Tarragona, and EIA) and three novel datasets containing trajectory data retrieved from public sources, which lead to our last research question:

*Q5: Do TSP-based microaggregation methods perform better than current solutions on trajectories datasets?*

#### *1.2. Plan of the Article*

The rest of the article aims to answer the research questions above, and it is organized as follows: Section 2 provides the reader with some fundamental knowledge on Statistical Disclosure Control and microaggregation. In addition, it introduces the basics of the Traveling Salesman Problem and an overview of the existing heuristics to solve it. Next, Section 3 analyzes related work and highlights the novelty of our proposal compared with the state of the art. Section 4 describes our proposal, which is later thoroughly tested and compared with well-known classical and state-of-the-art microaggregation methods in Section 5. Section 6 discusses the research questions and the main benefits of our proposal. The article concludes in Section 7 with some final remarks and comments on future research lines.

#### **2. Background**

#### *2.1. Statistical Disclosure Control and Microaggregation*

Statistical disclosure control (SDC) has the goal of preserving the statistical properties of datasets, whilst minimizing the privacy risks related to the disclosure of confidential information from individual respondents. Microaggregation is a family of SDC methods for microdata, which use data perturbation as a protection strategy.

Given an original data file *D* and a privacy parameter *k*, microaggregation can be defined as follows: Let us assume a microdata set *D* with *p* continuous numerical attributes and *n* records. Clusters (also referred to as groups or subsets in this context) of *D* are formed with *n<sup>i</sup>* records in the *i*-th cluster (*n<sup>i</sup>* ≥ *k* and *n* = ∑ *g i*=1 *ni* ), where *g* is the number of resulting clusters, and *k* a cardinality constraint. Optimal microaggregation is defined as the one yielding a *k*-partition maximizing the within-clusters homogeneity. Optimal microaggregation requires heuristic approaches since it is an NP-hard problem [11] for multivariate data. Microaggregation heuristics can be classified into two main families:


of size between *k* and 2*k* − 1 in which its overall within-cluster homogeneity is better than that of the single larger cluster.

Therefore, a microaggregation process consists in constructing a *k*-partition of the dataset, this is a set of disjoint clusters (in which the cardinality is between *k* and 2*k* − 1) and replacing each original data record by the centroid (i.e., the average vector) of the cluster to which it belongs, hence creating a *k*-anonymous dataset *D*0 . With the aim to reduce the information loss caused by the aggregation, the clusters are created so that the records in each cluster are similar.

#### *2.2. Data Utility and Information Loss*

The sum of square error (SSE) is commonly used for measuring the homogeneity in each group. In terms of sums of squares, maximizing within-groups homogeneity is equivalent to finding a *k*-partition minimizing the within-groups sum of square error (SSE) [13] defined as:

$$SSE = \sum\_{i=1}^{g} \sum\_{j=1}^{n\_i} (\mathbf{x}\_{i,j} - \mathbf{x}\_i)(\mathbf{x}\_{i,j} - \mathbf{x}\_i)',\tag{1}$$

where *xi*,*<sup>j</sup>* is the *j*-th record in group *i*, and *x*¯*<sup>i</sup>* is the average record of group *i*. The total sum of squares (SST), an upper bound on the partitioning information loss, can be computed as follows:

$$SST = \sum\_{i=1}^{n} (\mathbf{x}\_i - \mathbf{x})(\mathbf{x}\_i - \mathbf{x})',\tag{2}$$

where *x<sup>i</sup>* is the *i*-th record in *D*, and *x*¯*<sup>i</sup>* is the average record of *D*. Note that all the above equations use vector notation, so *<sup>x</sup><sup>i</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* .

The microaggregation problem consists in finding a *k*-partition with minimum SSE, this is, the set of disjoint subsets of *D* so that *D* = S*<sup>g</sup> m*=1 *sm*, where *s<sup>m</sup>* is the *m*-th subset, and *g* is the number of subsets, with minimum SSE. However, a normalized measure of information loss (expressed in percentage) is also used:

$$I\_{loss} = \frac{SSE}{SST} \times 100.\tag{3}$$

In terms of information loss, the worst case scenario for microaggregation would happen when all records in *D* are replaced in *D*0 by the average of the dataset (i.e., *SSE* = *SST* −→ *Iloss* = 100), and the best case scenario implies that *D* = *D*<sup>0</sup> (i.e., *k* = 1, no aggregation), which leads to *SSE* = *Iloss* = 0. Obviously, the latter case is optimal in terms of information loss, but it offers no privacy protection, at all. Hence, values for the protection parameter *k* are greater than one, typically: *k* = 3, 4, 5, or 6, and are chosen by privacy experts in statistical agencies so as to adapt to the needs of each particular dataset.

#### *2.3. Basics on the Traveling Salesman Problem*

The Traveling Salesman Problem (TSP) [14] consists of finding a particular *Hamiltonian cycle*. The problem can be stated as follows: a salesman leaves from one city and wants to visit (exactly once) each other city in a given group and, finally, return to the starting city. The salesman wonders in what order he should visit these cities so as to travel the shortest possible total distance.

In terms of graph theory, the TSP can be modeled by a graph *G* = (*V*, *E*), where cities are the nodes in set *V* = {*v*1, *v*2, ..., *vn*} and each edge *eij* ∈ *E* has an associated weight *wij* representing the distance between nodes *i* and *j*. The goal is to find a *Hamiltonian cycle*, i.e., a cycle which visits each node in the graph exactly once, with the least total weight. An alternative approach to the *Hamiltonian cycle* to solve the TSP is finding the *Shortest Hamiltonian path* through a graph (i.e., a path which visits each node in the graph exactly once). As an example, Figure 1 shows a short Hamiltonian path for the *Eurodist* dataset, which contains the distance (in km) between 21 cities in Europe.

**Figure 1.** A Hamiltonian path for the Eurodist dataset.

Finding an optimal solution to the TSP is known to be NP-Hard. Hence, several heuristics to find good but sub-optimal solutions have been developed. TSP heuristics typically fall into two groups: those involving minimum spanning trees for tour construction and those with edge exchanges to improve existing tours. There are numerous heuristics to solve the TSP [15,16]. In this article, we have selected a representative sample of heuristics, including well-known approaches and top performers from the state-of-the-art:


#### **3. Related Work on Microaggregation**

There is a wide variety of heuristics to solve the multivariate microaggregation problem in the literature. One of the most well-known methods is the Maximum Distance to Average Vector (MDAV), proposed by Domingo-Ferrer et al. [17]. This method iteratively creates clusters of *k* members considering the furthest records from the dataset centroid. A variant of MDAV was proposed by Laszlo et al., namely the Centroid-Based Fixed Size method (CBFS) [18], which also has optimized versions based on kd-tree neighborhood search, such as KD-CBFS and KD-CBFSapp [19]. The Two Fixed Reference Points (TFRP) method was proposed by Chang et al. [20]. It uses the two most extreme points of the dataset at each iteration as references to create clusters. Differential Privacy-based microaggregation was explored by Yang et al. [21], which created a variant of the MDAV algorithm that uses the correlations between attributes to select the minimum required noise to achieve the desired privacy level. In addition, V-MDAV, a variable group-size heuristic based on the MDAV method was introduced by Solanas et al. in Reference [13]

with the aim to relax the cardinality constraints of fixed-size microaggregation and allow clusters to better adapt to the data and reduce the SSE.

Laszlo and Mukherjee [18] approached the microaggregation problem through minimum spanning trees, aimed at creating graph structures that can be pruned according to each node's associated weights to create the groups. Lin et al. proposed a Density-Based Algorithm (DBA) [22], which first forms groups of records in density descending order, and then fine-tunes these groups in reverse order. The successive Group Selection based on sequential Minimization of SSE (GSMS) method [23], proposed by Panagiotakis et al., optimizes the information loss by discarding the candidate cluster that minimizes the current SSE of the remaining records. Some methods are built upon the HM algorithm. For example, Mortazavi et al. proposed the IMHM method [24]. Domingo-Ferrer et al. [17] proposed a grouping heuristic that combines several methods, such as Nearest Point Next (NPN-MHM), MDAV-MHM, and CBFS-MHM.

Other approaches have focused on the efficiency of the microaggregation procedure, for example, the Fast Data-oriented Microaggregation (FDM) method proposed by Mortazavi et al. [25] efficiently anonymizes large multivariate numerical datasets for multiple successive values of *k*. The interested readers can find more detailed information about microaggregation in Reference [5,26].

The most similar work related to ours is the one presented in Reference [27] by Heaton and Mukherjee. The authors use TSP tour optimization heuristics (e.g., 2-opt, 3-opt) to refine a path created with the information of a multivariate microaggregation method (e.g., MDAV, MD, CBFS). Notice that, in our proposed method (described in the next section), we use tour construction TSP heuristics instead of optimization heuristics; thus, we eliminate the need for using a multivariate microaggregation method as a pre-processing step, and we decrease the computational time without hindering data utility.

#### **4. Our Method**

Our proposal is built upon two main building blocks: a TSP tour construction heuristic (*H*), and the optimal univariate microaggregation algorithm of Hansen and Mukherjee (HM). As we have already explained in Section 2, the HM algorithm is applied to univariate numerical data, because it requires the input elements to be in order. However, we virtually use it with multivariate data; thus, when we do that, we refer to it as Multivariate Hansen-Mukherjee (MHM), although, in practice, the algorithm is univariate. Since our proposal is based on a Heuristic (H) to obtain a Hamiltonian Path and the MHM algorithm, we have come to call it HMHM-microaggregation or (*HM*) 2 -Micro, for short.

Given a multivariate microdata set (*D*) with *p* columns and *r* rows, we model it as a complete graph *G*(*N*, *E*), where we assume that each row is represented by a node *n<sup>i</sup>* ∈ *N* (or a city, if we think in terms of the TSP), and each edge *eij* ∈ *E* represents the Euclidean distance between *n<sup>i</sup>* and *n<sup>j</sup>* (or the distance between cities in TSP terms). Hence, we have a set of nodes *N* = {*n*1, *n*2, . . . *nr*} each representing rows of the microdata set in a multivariate space R*<sup>p</sup>* .

In a nutshell, we use *H* over *G* to create a Hamiltonian path (*Hpath*) that travels across all nodes. *<sup>H</sup>path* is a permutation (Π*<sup>N</sup>* <sup>=</sup> {*<sup>π</sup> N* 1 , *π N* 2 , . . . *π N <sup>r</sup>* }) of the nodes in *N*, and *de facto* it determines an order for the nodes (i.e., it provides a sense of precedence between nodes). Hence, although *D* is multivariate, its rows represented as nodes in *N* can be sorted in a univariant permutation *Hpath* that we use as input to the MHM algorithm. As a result, the MHM algorithm returns the optimal univariate *k*-partition of *Hpath*, this is, the set of disjoint subsets *S* = {*s*1,*s*2, . . .*st*} defining the clusters of *N*. Hence, since each node *n<sup>i</sup>* represents a row in *D*, which is indeed multivariate, we have obtained a multivariate microaggregation of the rows in *D* and provided a solution for the multivariate microaggregation. Notice that, although MHM returns the optimal *k*-partition of *Hpath*, it does not imply that the resulting microaggregation of *D* is optimal A schematic of our solution is depicted in Figure 2.

**Figure 2.** Given a microdata dataset, we use a tour construction heuristic to generate a Hamiltonian path, which will be used as the input of the MHM method to generate the groups.

Although the foundation of our proposal described above is pretty straightforward, it has the beauty of putting together complex mathematical building blocks from the multivariate and univariate worlds in a simple yet practical manner. In addition, our solution is very flexible, since it allows the use of any heuristic *H* to create the Hamiltonian path *Hpath*, and it allows for comprehensive studies, such as the one we report in the next section.

Note that most TSP heuristics output a Hamiltonian cycle. However, since we need a Hamiltonian path, we use the well-known solution of adding a dummy node in the graph (i.e., a theoretical node in which its distance to all other nodes is zero), and we cut the cycle by eliminating this node, so as to obtain a Hamiltonian path.

For the sake of completeness, we summarize our proposal step-by-step in Algorithm 1, and we next comment on it. Our solution can be seen as a meta-heuristic to solve the multivariate microaggregation problem, since it can accommodate any Heuristic (*H*) able to create a Hamiltonian cycle from a complete graph (*G*), and it could deal with any privacy parameter (*k*). Thus, our algorithm receives as input a numerical multivariate microdata set *D* with *p* columns (attributes) and *r* rows, that have to be microaggregated, a Heuristic *H*, and a privacy parameter *k* (see Algorithm 1: line 1). In order to avoid bias towards higher magnitude variables, the original dataset *D* (understood as a matrix) is standardized by subtracting to each element the average of its column and dividing it by the standard deviation of the column. The result is a standardized dataset *Dstd* in which each column has zero mean and unitary standard deviation (see Algorithm 1: line 2). Next, the distance matrix *Mdist* is computed. Each element *mij* ∈ *Mdist* contains the Euclidean distance between row *i* and row *j* in *Dstd*; hence, *Mdist* is a square matrix (*r* × *r*) (see Algorithm 1: line 3). In order to be able to cut the Hamiltonian Cycle and obtain a Hamiltonian path, we add a dummy node to the dataset by adding a zero column and a zero row to *Mdist* and generate *Mdum dist* , which is a square matrix (*<sup>r</sup>* <sup>+</sup> <sup>1</sup> <sup>×</sup> *<sup>r</sup>* <sup>+</sup> 1) (see Algorithm 1: line 4). *<sup>M</sup>dum dist* is, in fact, a weighted adjacency matrix that defines a graph *G*(*N*, *E*) with nodes *N* = {*n*1, . . . , *nr*+1} and edges *E* = {*e*11, . . .*ei*,*<sup>j</sup>* . . .*er*+1,*r*+1} <sup>=</sup> {*Mdum dist* 1,1, . . . *<sup>M</sup>dum dist r*+1,*r*+1 }. With this matrix as

an input, we could compute a Hamiltonian Cycle *Hcycle* on *G* by applying a TSP heuristic *H* (see Algorithm 1: line 5). Notice that this Heuristic *H* could be anyone that gets as input a weighted graph and returns a Hamiltonian cycle. Some examples are: Concorde, Nearest Neighbor, Repetitive Nearest Neighbor, and Insertion Algorithms. After obtaining *Hcycle*, we cut it by removing the dummy node (see Algorithm 1: line 6), and we obtain a Hamiltonian path *<sup>H</sup>path* that defines a permutation (Π*<sup>N</sup>* <sup>=</sup> {*<sup>π</sup> N* 1 , *π N* 2 , . . . *π N <sup>r</sup>* }) of the nodes in *N*, as well as determines an order for the nodes that can be inputted to the MHM algorithm to obtain its optimal *k*-partition (*S*) (see Algorithm 1: line 7). *S* is a set of disjoint subsets *S* = {*s*1,*s*2, . . .*st*} defining the clusters of nodes in *N*. Hence, with *S* and *D*, we could create a microaggregated dataset *D*0 by replacing each row in *D* by the average vector of the *k*-partition subset to which it belongs (see Algorithm 1: line 8).

After applying the algorithm, we have transformed the original dataset *D* into a dataset *D*0 that has been microaggregated so as to guarantee the privacy criteria established by *k*.

#### **Algorithm 1** (*HM*) 2 -Micro

1: **function** (*HM*) 2 -MICRO( Microdata set *D*, TSP-Heuristic *H*, Privacy Parameter *k*)


#### **5. Experiments**

With the aim to practically validate the usefulness of our multivariate microaggregation proposal, we have thoroughly tested it on six datasets (described in Section 5.1) that serve as benchmarks. In addition, we are interested in knowing (if and) to what extend our method outperforms the best performing microaggregation methods in the literature. Hence, we have compared our proposal with these methods (described in Section 5.2), and the results of all these tests are summarized in Section 5.3. Overall, considering four different values for the privacy parameter *k* ∈ {3, 4, 5, 6}, ten microaggregation algorithms, 50 repetitions per case, and six datasets, we have run over 12.000 microaggregation tests, which allow us to provide a statistically solid set of results.

#### *5.1. Datasets*

We used six datasets as benchmarks for our experiments. We can classify those datasets into two main groups: The first group comprises three well-known SDC microdata sets that have been used for years as benchmarks in the literature, namely "Census", "EIA", and "Tarragona". The second group comprises three mobility datasets containing real GPS traces from three Spanish cities, namely "Barcelona", "Madrid", and "Tarraco". Notice that we use the term "Tarraco", the old Roman name for the city of Tarragona, in order to avoid confusion with the classic benchmark dataset "Tarragona". The features of each dataset are next summarized:

The Census dataset was obtained using the public *Data Extraction System of the U.S. Census Bureau*. It contains 1080 records with 13 numerical attributes. The Tarragona dataset was obtained from the Tarragona Chamber of Commerce. It contains information on 834 companies in the Tarragona area with 13 variables per record. The EIA dataset was obtained from the U.S. Energy Information Authority, and it consists of 4092 records with 15 attributes. More details on the aforementioned datasets can be obtained in Reference [28].

The Barcelona, Madrid, and Tarraco datasets consist of OpenStreetMap [29] GPS traces collected from those cities: Barcelona contains the GPS traces of the city of Barcelona within the area determined by the parallelogram formed by latitude (41.3726866, 41.4078446) and longitude (2.1268845, 2.1903992). The dataset has 969 records with 30 GPS locations each. Madrid contains the GPS traces of the city of Madrid within the area determined by the parallelogram formed by latitude (40.387613, 40.483515) and longitude (−3.7398145, −3.653985). The dataset has 959 records with 30 GPS locations each. Tarraco contains the GPS traces of the city of Tarragona within the area determined by the parallelogram formed by latitude (41.0967083, 41.141174) and longitude (1.226008, 1.2946691). The dataset has 932 records with 30 GPS locations each.

In all trajectories datasets, each record consists of 30 locations represented as (latitude and longitude). Hence, each record has 60 numerical values. These locations were extracted from each corresponding parallelogram according to the amount of recorded tracks and their length.

All datasets are available in our website: https://www.smarttechresearch.com/publ ications/symmetry2021-Maya-Casino-Solanas/ (accessed on 1 May 2021).


**Table 1.** Comparing methods and features. For Concorde, *M* is a bound on the time to explore subproblems, *b* is a branching factor, and *d* is a search depth.

#### *5.2. Compared Methods*

We have selected a representative set of well-known and state-of-the-art methods to assess the value of our approach. We have selected two classic microaggregation methods (i.e., **MDAV** and **V-MDAV**), as baselines. In the case of V-MDAV, the method was run for several values of *γ* ∈ {0, 2}, and the best result is reported. Although some other newer methods might have achieved better results, they are still landmarks that deserve to be included in any microaggregation comparison.

For newer and more sophisticated methods, we have considered the work of Heaton and Mukherjee [27], in which they study a variety of microaggregation heuristics, including methods, such as CBFS and MD. Thus, instead of comparing our proposal with all those methods, we have taken the method that Heaton and Mukherjee reported as the best performer, namely the **MDAV-LK-MHM** method. This method, which is based on MDAV, first creates a microaggregation using MDAV, next improves the result of MDAV by apply-

ing the LK heuristic, and it finally applies MHM to obtain the resulting microaggregation (cf. Reference [27] for further details on the algorithm).

Regarding our proposal (i.e., (*HM*) 2 **-Micro**), as we already discussed, it can be understood as a meta-heuristic able to embody any heuristic *H* that returns a Hamiltonian Cycle. Hence, with the aim to determine the best heuristic, we have analyzed seven alternatives, namely **Nearest Neighbor**, **Repetitive Nearest Neighbor**, **Nearest Insertion**, **Farther Insertion**, **Cheapest Insertion**, **Arbitrary Insertion**, and (our suggestion) **Concorde**. Table 1 summarizes some features of all selected methods, including the reference to the original article where the method was described. For our method, each reference points to the article describing the TSP heuristic.

The implementation of all these methods have used the *R* package *sdcMicro* [28], the TSP heuristics implemented in Reference [32], and the LK heuristics implemented in Reference [33]. LK has been configured so that the algorithm runs once at each iteration parameter RUN=1 until a local optimum is reached. This same criteria was followed for the other TSP heuristics. In this regard, the heuristics we used consider a random starting node at each run. Hence, each experiment has been repeated 50 times to guarantee statistically sound outcomes regardless of this random starting point.

#### *5.3. Results Overview*

By using the datasets and methods described above, we have analyzed the Information Loss (expressed in percentage), as a measure of data utility (cf. Section 2 for details). It is assumed that, given a privacy parameter *k* that guarantees that the microaggregated dataset is *k*-anonymous, the lower the Information Loss the better the result and performance of the microaggregation method. The results are reported in Tables 2–7 with the best (lowest) information loss highlighted in green.

Overall, it can be observed that our method, (*HM*) 2 -Micro, with the Concorde heuristic is the best performer in 79% of the experiments, and it is the second best in the remaining 21% (for which the MDAV-LK-MHM outperforms it by a narrow margin of less than 2%). Interestingly enough, although (*HM*) 2 -Micro, with both Nearest Insertion and Farthest-Insertion, is not the best performer in any experiment, it outperforms MDAV-LK-HMH 50% of the times. The rest of the methods obtain less consistent results and highly depend on the dataset.

When we analyze the results more closely for each particular dataset, we observe that, in the case of the "Census" dataset (cf. Table 2), our method with Concorde outperforms all methods for all values of *k*. In addition, despite the random nature of TSP-heuristics, the values of *σ* are very stable, denoting the robustness of all methods, yet slightly higher on average in the case of the methods with higher Information Loss. It is worth emphasizing though, that, in all runs, our method with Concorde and the MDAV-LK-MHM method obtained better results than MDAV and V-MDAV (i.e., the max values obtained in all runs are lower than the outcomes obtained by MDAV and V-MDAV).


**Table 2.** Information Loss obtained on the Census dataset.

For the "EIA" dataset (cf. Table 3), MDAV-LK-MHM is the best performer for all values of *k* except *k* = 5, for which our proposal with Concorde performs better. In this case, the results obtained by these two methods are very close. Similarly to the results in "Census", the max values obtained by these two methods outperform MDAV and V-MDAV. In the case of "Tarragona" (cf. Table 4), our method with Concorde outperforms all other methods. Surprisingly, both MDAV and V-MDAV obtain better results than MDAV-LK-MHM, which performs poorly in this dataset.


**Table 3.** Information Loss obtained on the EIA dataset.

**Table 4.** Information Loss obtained on the Tarragona dataset.


So, it can be concluded that the overall winner for the classical benchmarks (i.e., Census, EIA, and Tarragona) is our method, (*HM*) 2 -Micro, with the Concorde heuristic, that is only marginally outperformed by MDAV-LK-MHM in the EIA dataset.

Regarding the other three datasets containing GPS traces (i.e., Barcelona, Madrid and Tarraco), our method, (*HM*) 2 -Micro, with the Concorde heuristic, is the best performer in 83% of the cases and comes second best in the remaining 17%. For the Barcelona dataset (cf. Table 5), MDAV-LK-MHM and (*HM*) 2 -Micro, with the Concorde heuristic, perform very well and similarly. The methods with the worst Information Loss are MDAV and V-MDAV. Our method, (*HM*) 2 -Micro, with the Insertion heuristics, have a remarkable performance, obtaining values similar to those of MDAV-LK-MHM and Concorde. Nevertheless, it is worth noting that the max (worst) values obtained by MDAV-LK-MHM and Concorde are still better than the averages obtained by the other methods. In the case of the Madrid dataset (cf. Table 6), our method, (*HM*) 2 -Micro, with the Concorde heuristic, achieves the minimum (best) value of Information Loss for all values of *k*. We can also observe that our method with Insertion heuristics offers higher performance than MDAV-LK-MHM. Finally, the results for the Tarraco dataset (cf. Table 7) show that the minimum (best) Information Loss value is obtained by our method with the Concorde heuristic in all cases. In this case, MDAV-LK-MHM performs poorly, and, for *k* = 3 and *k* = 4, MDAV and V-MDAV are better.



**Table 6.** Information Loss obtained on the Madrid dataset.


#### **Table 7.** Information Loss obtained on the Tarraco dataset.


We have already discussed that all studied methods (with the exception of MDAV and V-MDAV) have a non-deterministic component emerging from the random selection of the initial node. This random selection affects the performance of the final microaggregation obtained. With the aim to analyze the effect of this non-deterministic behavior, we have studied the standard deviation of all methods for all values of *k* and for all datasets. In addition, we have visually inspected the variability of the results by using box plot diagrams.

Since the results are quite similar and consistent across all datasets, for the sake of clarity, we only reproduce here the box plots for the "Census" dataset (see Figure 3), and we leave the others in Appendix A for the interested reader.

**Figure 3.** Information Loss variability for each value of *k* over the Census dataset.

In Figure 3, we can observe that the Information Loss values increase with *k*, but all methods have the same behavior regardless of the value of *k*. In addition, it is clear that the most stable methods are (*HM*) 2 -Micro, with Concorde, and MDAV-LK-MHM.

Overall, we observe some expected differences depending on the datasets. However, the behavior of the best performing methods is stable. Particularly, the datasets with GPS traces (i.e., Barcelona, Madrid, and Tarraco) show more stable results. In summary, the best method was our (*HM*) 2 -Micro with Concorde, exhibiting the most stable results across all datasets.

#### **6. Discussion**

Over the previous sections, we have presented our microaggregation method, (*HM*) 2 - Micro, its rationale, and its performance against other classic and state-of-the-art methods on a variety of datasets. In the previous section, we have reported the main results, and we will discuss them next by progressively answering the research questions that we posed in the Introduction of the article.

#### *Q1: How to create a suitable ordering for a univariate microaggregation algorithm, when the records are in* R*<sup>p</sup> .*

A main takeaway of this article is that, by using a combination of TSP tour construction heuristics (e.g., Concorde) and an optimal univariate microaggregation algorithm, we are properly ordering multivariate datasets in a univariate fashion that leads to excellent multivariate microaggregation solutions. Other approaches to order R*<sup>p</sup>* points might consider projecting them over the principal component. However, the information loss associated with this approach makes it unsuitable. In addition, other more promising approaches, like the one used in MDAV-LK-MHM, first create a *k*-partition and set an order based on maximum distance criteria. Although this approach might work well in some cases, we have clearly seen that Hamiltonian paths created by TSP-heuristics, like Concorde, outperform this approach. Hence, based on the experiments of Section 5, we can conclude that TSP-heuristics, like Concorde, provide an order for elements in R*<sup>p</sup>* that is suitable for an optimal univariate microaggregation algorithm to output a consistent multivariate microaggregation solution with low Information Loss (i.e., high data utility). Moreover, from all analyzed heuristics, it is clear that the best performer is Concorde, followed by insertion heuristics.

*Q2: Are the length of the Hamiltonian path and the information loss of the microaggregation related?, or Do shorter Hamiltonian paths lead to microaggregation solutions with lower information loss?*

When we started this research, our intuition was that good heuristic solutions of the TSP (i.e., those with shorter path lengths) would provide a Hamiltonian path, that could be used as an ordered vector for the HM optimal univariate microaggregation algorithm, resulting in a good multivariate microaggregation solution. From this intuition, we assumed that shorter Hamiltonian paths would lead to lower Information Loss in microaggregated datasets.

In order to validate (or disproof) this intuition, we have analyzed the Pearson correlation between the Hamiltonian path length obtained by all studied heuristics (i.e., Nearest Neighbor, Repetitive Nearest Neighbor, Nearest Insertion, Farther Insertion, Cheapest Insertion, Arbitrary Insertion, and Concorde) and the SSE of the resulting microaggregation. We have done so for all studied datasets and *k* values. The results are summarized in Table 8, and all plots along with a trend line are available in Appendix B.


**Table 8.** Summary of the Pearson correlation between Path Length and SSE.

From the correlation analysis, it can be concluded that there is a positive correlation between the Hamiltonian path length and the SSE. This is, the shorter the path length the lower the SSE. This statement holds for all *k* and for all datasets (although Census exhibits a lower correlation). Hence, although this result is not a causality proof, it can be safely said that good solutions of the TSP problem lead to good solutions of the multivariate microaggregation problem. In fact, the best heuristic (i.e., Concorde) always results in the lowest (best) SSE.

Interested readers can find all plots in Appendix B. However, for the sake of clarity, let us illustrate this result by discussing the case of the Madrid dataset with *k* = 6, depicted in Figure 4. In the figure, the positive correlation is apparent. In addition, it is clear that heuristics tend to form clusters. In a nutshell, the best heuristic is Concorde, followed by the insertion family of methods (i.e., Nearest Insertion, Furthest Insertion, Cheapest Insertion, and Arbitrary Insertion), followed by Repetitive Nearest Neighbor and Nearest Neighbor.

Although Figure 4 clearly illustrates the positive correlation between the path length and the SSE, it also shows that heuristics tend to cluster and might indicate that not only the path but the heuristic (per se) plays a role in the reduction of the SSE. This indication leads us to our next research question.

**Figure 4.** Relation between SSE and Path Length for Madrid and *k* = 6.

*Q3: Is the length of the Hamiltonian path the only factor affecting information loss or does the particular construction of the path (regardless of the length) affect the information loss?*

In the previous question, we have found clear positive correlation between the path length and the SSE. However, we have also observed apparent clusters suggesting that the very heuristics could be responsible for the minimization of the SSE. In other words, although the path length and SSE are positively correlated when all methods are analyzed together, would this correlation hold when heuristics are analyzed one at a time? In order to answer this question, we have analyzed the results of each heuristic individually, and we have observed that there is still positive correlation between path length and SSE, but it is very weak or almost non-existent (i.e., very close to 0), as Figure 5 illustrates.

**Figure 5.** Correlation between path length and SSE for each individual method (from top to bottom: Cheapest Insertion, Concorde, and Nearest Neighbor) for *k* = 3 over the Madrid dataset.

The results shown in Figure 5 are only illustrative, and a deeper analysis that is out of the scope of this paper would be necessary. However, our initial results indicate that, although there is positive correlation between path length and SSE globally, this correlation

weakens significantly when analyzed on each heuristic individually. This result suggests that it is not only the length of the path but the way in which this path is constructed what affects the SSE. This would explain why similar methods (e.g., those based on insertion) behave similarly in terms of SSE although their paths' length varies.

#### *Q4: Does* (*HM*) 2 *-Micro provide better solutions (in terms of information loss) than the best performing microaggregation methods in the literature?*

This question has been already answered in Section 5.3. However, for the sake of completeness, we summarize it here: The results obtained after executing more than 12,000 tests suggest that our solution (*HM*) 2 -Micro obtains better results than classic microaggregation methods, such as MDAV and V-MDAV. Moreover, when (*HM*) 2 -Micro uses the Concorde heuristic to determine the Hamiltonian path, it outperforms the best state-of-the-art methods consistently. In our experiments, (*HM*) 2 -Micro with Concorde was the best performer 79% of the times and was the second best in the remaining 21%.

#### *Q5: Do TSP-based microaggregation methods perform better than current solutions on trajectories datasets?*

(*HM*) 2 -Micro with Concorde is the best overall performer. Moreover, if we focus on those datasets with trajectory data (i.e., Barcelona, Madrid, and Tarraco), the results are even better. It is the best performer in 83% of the tests and the second best in the remaining 17%. This good behavior of the method could result from the very foundations of the TSP; however, there is still plenty of research to do in this line to reach more solid conclusions. Location privacy is a very complex topic that encompasses many nuances beyond *k*-anonymity models (such as the one followed in this article). However, this result is an invigorating first step towards the analysis of novel microaggregation methods applied to trajectory analysis and protection.

#### **7. Conclusions**

Microaggregation has been studied for decades now, and, although finding the optimal microaggregation is NP-Hard and a polynomial-time microaggregation algorithm has not been found, steady improvements over microaggregation heuristics have been made. Hence, after such a long research and polishing process, finding new solutions that improve the best methods is increasingly difficult. In this article, we have presented (*HM*) 2 - Micro, a meta-heuristic that leverages the advances in TSP solvers and combines them with the optimal univariate microaggregation to create a flexible and robust multivariate microaggregation solution.

We have studied our method and thoroughly compared it to classic and state-of-the-art microaggregation algorithms over a variety of classic benchmarks and trajectories datasets. Overall, we have executed more than 12,000 tests, and we have shown that our solution embodying the Concorde heuristic outperforms the others. Hence, we have shown that our TSP-inspired method could be used to guarantee *k*-anonymity of trajectories datasets whilst reducing the Information Loss, thus increasing data utility. Furthermore, our proposal is very stable, i.e., it does not change significantly its performance regardless of the random behavior associated with initial nodes selection.

In addition to proposing (*HM*) 2 -Micro, we have found clear correlations between the length of Hamiltonian Paths and the SSE introduced by microaggregation processes, and we have shown the importance of the Hamiltonian Cycle construction algorithms over the overall performance of microaggregation.

Despite these relevant results, there is still much to do in the study of microaggregation and data protection. Future work will focus on scaling up (*HM*) 2 -Micro to highdimensional and very-large datasets. Considering the growing importance of Big Data and Cloud Computing, adapting our solution to distributed computation environments is paramount. Moreover, adjusting TSP heuristics to leverage lightweight microaggregationbased approaches is an interesting research path to follow. In addition, although the values of the privacy parameter *k* are typically low (i.e., 3, 4, 5, 6), we plan to study the effect of

larger values of *k* on our solution. Last but not least, since microaggregation is essentially a data-oriented procedure, we will study how our solution adapts to data structures from specific domains, such as healthcare, transportation, energy, and the like.

All in all, with (*HM*) 2 -Micro, we have set the ground for the study of multivariate microaggregation meta-heuristics from a new perspective, that might continue in the years to come.

**Author Contributions:** Conceptualization, A.M.-L. and A.S.; methodology, A.S.; software, A.M.- L.; validation, A.M.-L., F.C. and A.S.; formal analysis, A.S.; investigation, A.M.-L., F.C. and A.S.; resources, A.S.; data curation, A.M.-L.; writing—original draft preparation, A.M.-L. and A.S.; writing review and editing, F.C. and A.S.; visualization, F.C.; supervision, F.C. and A.S.; project administration, A.S.; funding acquisition, A.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the European Commission under the Horizon 2020 Programme (H2020), as part of the project *LOCARD* (https://locard.eu (accessed on 1 May 2021)) (Grant Agreement no. 832735), and by the Spanish Ministry of Science & Technology with project IoTrain RTI2018-095499-B-C32. The content of this article does not reflect the official opinion of the European Union. Responsibility for the information and views expressed therein lies entirely with the authors.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Dataset and high resolution data can be found in: https://www.smar ttechresearch.com/publications/symmetry2021-Maya-Casino-Solanas/ (accessed on 1 May 2021).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Information Loss Variability Box Plots**

**Figure A1.** Information Loss variability for each value of *k* over the Census dataset.

**Figure A2.** Information Loss variability for each value of *k* over the EIA dataset.

**Figure A3.** Information Loss variability for each value of *k* over the Tarragona dataset.

**Figure A4.** Information Loss variability for each value of *k* over the Barcelona dataset.

**Figure A5.** Information Loss variability for each value of *k* over the Madrid dataset.

**Figure A6.** Information Loss variability for each value of *k* over the Tarraco dataset.

#### **Appendix B. Correlation Analysis between "Path Length" and "SSE"**

**Figure A7.** Relation between SSE and Path Length for Census and *k* ∈ {3, 4, 5, 6}.

**Figure A8.** Relation between SSE and Path Length for EIA and *k* ∈ {3, 4, 5, 6}.

**Figure A9.** Relation between SSE and Path Length for Tarragona and *k* ∈ {3, 4, 5, 6}.

**Figure A10.** Relation between SSE and Path Length for Barcelona and *k* ∈ {3, 4, 5, 6}.

**Figure A11.** Relation between SSE and Path Length for Madrid and *k* ∈ {3, 4, 5, 6}.

**Figure A12.** Relation between SSE and Path Length for Tarraco and *k* ∈ {3, 4, 5, 6}.

#### **References**

