*Article* **Developing a New Robust Swarm-Based Algorithm for Robot Analysis**

#### **Abubakar Umar 1, Zhanqun Shi 1,\*, Alhadi Khlil <sup>1</sup> and Zulfiqar I. B. Farouk <sup>2</sup>**


Received: 1 November 2019; Accepted: 2 January 2020; Published: 22 January 2020

**Abstract:** Metaheuristics are incapable of analyzing robot problems without being enhanced, modified, or hybridized. Enhanced metaheuristics reported in other works of literature are problem-specific and often not suitable for analyzing other robot configurations. The parameters of standard particle swarm optimization (SPSO) were shown to be incapable of resolving robot optimization problems. A novel algorithm for robot kinematic analysis with enhanced parameters is hereby presented. The algorithm is capable of analyzing all the known robot configurations. This was achieved by studying the convergence behavior of PSO under various robot configurations, with a view of determining new PSO parameters for robot analysis and a suitable adaptive technique for parameter identification. Most of the parameters tested stagnated in the vicinity of strong local minimizers. A few parameters escaped stagnation but were incapable of finding the global minimum solution, this is undesirable because accuracy is an important criterion for robot analysis and control. The algorithm was trained to identify stagnating solutions. The algorithm proposed herein was found to compete favorably with other algorithms reported in the literature. There is a great potential of further expanding the findings herein for dynamic parameter identification.

**Keywords:** PSO; robot; manipulator; analysis; kinematic parameters; identification

#### **1. Introduction**

The quest for developing improved techniques for parameter identification of industrial robots has resulted in the novel concept of a mutating particle swarm optimization (MuPSO) based algorithm for analyzing multi-degree of freedom robot manipulators which was briefly introduced in [1], the research sought to employ artificial intelligence, particularly population-based Evolutionary Algorithms (EA), and computational methods for solving kinematic and dynamic problems of industrial manipulators. A robot manipulator is an electro-mechanical device that depicts the upper human limb. It was originally used in industrial workspaces to carry out tasks that were deemed boring, repetitive, highly monotonous, or dangerous, and therefore not suitable for human labor. Recent applications of robot manipulators include aeronautics and medicine, where the tolerance is very tight and human errors could be fatal. A robot manipulator comprises of solid non-moveable links connected by joints which allow either rotational or translational motion between successive links.

The increasing demand for robot manipulators has required that the manipulators become more autonomous and therefore increased accuracy and stability. The kinematic problems of robot manipulators were traditionally computed using analytical techniques which sometimes required finding the derivative of computationally expensive functions. These problems were found to sometimes possess multiple solutions or even no solution at all. Recently, swarm-based techniques have been studied which promises improved computational efficiency.

#### *1.1. Swarm Intelligence*

Evolutionary computation algorithms (EA) are stochastic optimization methods which have proven suitable for solving complex structured optimization and combinatory problems typical of robot analysis. They are biologically inspired population-based techniques that have relatively simple structures which are robust and computationally efficient. A variety of these algorithms have been developed over the years but based on simple implementation and the ability to readily combine with other algorithms, the PSO algorithm stands out. PSO was initially introduced by [2], despite being amongst the earliest EA algorithms, PSO remains relevant as it is still being improved, enhanced, and modified for solving real-world optimization problems.

In additive manufacturing (3D printing), constructing over-hanging features can only be achieved by introducing some support structures beneath the overhang which can be removed afterward to get the desired shape, [3] used a hybrid variation of PSO with greedy algorithm to reduce the volume of the support structure thereby save printing time, material and minimize budget. The recent hype in groundbreaking fifth-generation (5G) wireless communication technology presents the need to improve the quality of service with massive multiple-input multiple-output (MIMO) antenna arrays, [4] used a contraction adaptive PSO to optimize the design and positions of antenna array elements. Inspired by the success achieved by the proportional integral derivative (PID) controller in automation and its vast industrial applications [5–7] attempted using PID techniques to improve the performance of PSO. Reference [8] proposed the novel PID based strategy PSO (PBSPSO) algorithm based on the PID controller which was found to improve convergence while reducing stagnation of the PSO algorithm. A new variant of PSO with cross-over operation (PSOCO) was introduced by [9] which improved divergent search abilities of the PSO while avoiding stagnation by implementing a new learning model for the particles' velocity formula and two cross-over operations, while [10] used PSO and multi-objective PSO to develop a two-stage auto-tuning technique for PID controllers. Automation of logistics, maintenance/support, storage, and others have led to rapid improvements in the vehicle routing problem (VRP) algorithm. The pick-up and delivery problem (PDP) is an extension of the VRP which collects goods from suppliers or pick-up points and conveys them to the delivery points, [11] introduced a novel pick-up and delivery problem with transfers (PDPT) algorithm using a hybrid PSO and local search algorithm to minimize distance and maximize profit. A PSO variant based on random perturbation (RP-PSO) was used to identify the parameters of a model pressurized water reactor nuclear power plant in [12]. The estimation of distribution algorithm (EDA) framework has been demonstrated in [13] to have high performance despite little memory requirements, it was combined with PSO in [14] and was used to estimate and preserve the distribution information of particles' historical memories (personal best positions) to help the algorithm break out of local minimum solutions. The particle swarm estimation of distribution algorithms (PSDA) was also implemented in [15] for optimal-driven-projection of automated medical diagnosis and prognosis. Medical diagnosis is a key process in clinical medicine for identifying diseases, reducing cost, and enhancing accuracy. Enhanced algorithms were also exploited in [16–18] for diagnosis. Still, in medical sciences, minimally invasive surgery is a cost-effective alternative to open surgery where specialized instruments are used to operate by inserting them into several tiny punctures instead of one large incision. Reference [19] combined PSO with a back-propagation neural network (BPNN) algorithm to optimize the target position of the medical puncture robot.

#### *1.2. Particle Swarm Optimization*

The PSO consists of population members known as particles. Each particle refers to a bird in a flock, fish in a swarm, or in this case a possible solution to an optimization problem. The algorithm is initiated by populating it with *n* random particles, each particle has *m* dimensions, for the sake of robot analysis, the dimensions would be regarded as the degree of freedom (DOF). The position and velocity vectors of the *i*th particle can be defined as *Xi* and *Vi* in Equations (1) and (2) below. The position and velocity of every particle in the swarm is updated according to Equations (3) and (4) through

every iteration, the first part of the Equation (3) is the previous velocity which describes the particles previous experience, the second part of the equation is the cognitive component which describes the particles personal experience while the third part of the equation is the social component which describes the entire swarms best experience. The inertia weight *w* is a learning coefficient associated with the previous velocity, while *c*<sup>1</sup> and *c*<sup>2</sup> are learning coefficients associated with the cognitive and social components respectively. The fitness function is a mathematical representation of the real-world problem to be analyzed, it evaluates how well the particles adapt to the actual solution of the problem. The leader of the swarm (fittest particle) is the particle that best adapts to the solution as it is updated through every iteration. The algorithm keeps a record of the solution of the fittest particle and also that of the best position achieved by each particle. The personal best position ever achieved by *i*th particle and global best position of the swarm is defined as *PiBest* and *GiBest* in Equations (5) and (6). The particles in the swarm would be seen to consistently move towards the solution of the problem through every iteration by updating the particle's position, Equation (4), towards these two best memories (*PiBest* and *GiBest*).

$$X\_{i} = (\mathbf{x}\_{i1}, \mathbf{x}\_{i2}, \dots, \mathbf{x}\_{i\text{in}}),\tag{1}$$

$$\mathbf{V}\_{i} = (v\_{i1}, v\_{i2}, \dots, v\_{im}),\tag{2}$$

$$V\_i(t+1) = w \cdot r \cdot V\_i(t) + c\_1 \cdot r\_1(P\_{i\text{Ileft}t}(t)) - X\_i(t) + c\_2 \cdot r\_2(G\_{i\text{Ileft}t}(t) - X\_i(t)),\tag{3}$$

$$X\_i(t+1) = X\_i(t) + V\_i(t+1),\tag{4}$$

$$P\_{i\text{Best}} = (p\_{i\text{Best}}, p\_{i\text{Best}}, \dots, p\_{i\text{Best}}),\tag{5}$$

$$G\_{\text{iBest}} = (\text{gähest}, \text{gähest}, \dots, \text{gähest}),\tag{6}$$

where *rj* is a randomly generated number between [0, 1] and *j* - [0, 1, 2]. PSO has a high convergence speed which is very desirable for robot applications, but this convergence speed sometimes results in stagnation which is a major limitation of the PSO algorithm. The characteristic of the PSO algorithm that allows it to define promising regions in search space is referred to as exploration while exploitation allows it to refine solutions within the defined promising region. These are the major characteristics of any PSO algorithm, a good algorithm balances these properties to find the best solution to a problem while avoiding stagnation. The contributions of [20] showed that these properties can be tuned by carefully selecting the value of *w* of the PSO algorithm. The concept of constriction coefficient was introduced, setting the inertia weight at *0.712* while the cognitive and social coefficient were both *1.494*. This version of the PSO has come to be known today as the standard PSO (SPSO). The biological background of SPSO is believed to have evolved from the bird-like objects or BOIDS introduced by [21] to simulate flocking birds or animals in virtual reality studios. The BOIDS was governed by three basic rules; separation, alignment, and cohesion. The SPSO ignored the alignment and cohesion rules to reduce computational cost and increase convergence speed. Reference [22] proposed re-incorporating these rules reduces the convergence speed which is advantageous in pushing the algorithm out of stagnation. The effects of topologies on PSO were studied in [23], the SPSO has a star topology, where every individual is connected to other individuals such that information or direction of search is communicated and implemented throughout the swarm. Observing that this topology allows too much communication between the swarm particles and may be responsible for the quick convergence and stagnation of the SPSO, Reference [24] investigated the circle, wheel, and random topologies which isolate the individual particles of the swarm at different degrees so that information is communicated to the swarm by the focal individual or short-cuts between the isolated groups thereby causing a buffering effect which reduces the convergence speed of the swarm and improves search results. In [25], it was shown that the success of individual particles is not as a result of only the particle with the best fitness but by the influence of the entire swarm to a certain degree. An algorithm was presented that allowed every particle to have a weighted influence on other particles, based on their fitness such that particles with higher fitness exerted more influence. References [26,27] proposed the combination of SPSO with

other computational search methods like conjugate gradient and steepest descent respectively. These variations of PSO also aimed at slowing down the convergence speed by allowing the algorithm to stop and search for promising regions in the local space, while [28] successfully merged PSO with ABC for nonlinear statistical analysis.

Swarms are best suited for analyzing static search spaces with one global solution (unimodal). In reality, most robot analysis problems contain more than one local solution (multimodal) and the search space is sometimes dynamic, therefore maintaining diversity is crucial for the performance of PSO. Various strains of PSO involving sub-swarms have been developed for this purpose, PSO was combined with expanding neighborhood topology in [29] to solve the permutation flow-shop scheduling problem. The algorithm is initiated with sub-swarms of small size neighborhoods, slowly expanding through every iteration, absorbing other particles, taking advantage of both the global and local neighborhood structures to increase the performance of the PSO algorithm. Competitive strategy was used in [30] to manage convergence, while entropy measurement was employed to maintain the diversity of the swarm. In [31], an adaptive multi-swarm competition PSO was proposed where the swarm is adaptively divided into sub-swarms and a competition mechanism is used to maintain diversity in the swarms. The sub-swarms slowly converge, adaptively reducing the number of swarms while balancing between exploration and exploitation tendencies. Other algorithms that employed sub-swarms include [32–35]. Although multi-swarm based algorithms were found to be efficient for solving multimodal problems, these algorithms have very high computational cost [33]. The new trends of adaptive SPSO where the inertia weight and acceleration components are altered during the search process is capable of improving exploration and exploitation tendencies with less computational cost [36]. In [37–40] the parameters of the adaptive PSO were dependent on the quality of the solution and tailored to the specific problem, the parameters were updated by comparing the values of the best particles (*PiBest and GiBest*). This technique was found favorable in analyzing both static and dynamic search spaces without incurring too much computational cost.

#### *1.3. PSO and Robot Parameter Identification*

Over the years, the use of PSO for robot parameter identification has been researched, a comparison between the linear least squares (LLS) method and the PSO was presented in [41] for the dynamic parameter identifications of a 3DOF Staubli RX-60 robot manipulator, where the PSO was found to produce better results. Reference [42] combined the linear simplification of the LLS and the non-linear optimization of the PSO for online and offline parameter identification of space robots. Space robots encounter changes in their kinematic parameters while running in the orbit. The non-linear model is first used for parameter identification in the offline mode while the LLS is used for online identification in the follow-up mission knowing that the parameters would not change much. These works and other similar research works exhibit the superiority of intelligent swarm-based techniques over traditional methods. A hybridized genetic algorithm and PSO (GAPSO) was implemented by [43] for parameter identification of a SCARA robot, [44] also implemented a hybridized BPNN and PSO for determining the kinematic parameters of a 6DOF robot manipulator, while [45] investigated the performance of seven PSO variants in solving the inverse kinematics of 2DOF robots. In [46,47], a combination of PSO and simulated annealing (SA) was used to optimize the geometric structure of non-redundant 6DOF manipulators. In [48] the Elitist Learning Strategy PSO (ELS-PSO) for dynamic parameter analysis of a 3DOF Staubli RX-60 robot manipulator. The Quantum-Behaved PSO (QPSO) was implemented for parameter identification of a puma 560 robot by [49] in two steps by first optimizing the individual joint parameters so that the identified values are close to the theoretical values, then all the joint parameters are further optimized simultaneously around the previously converged values. During the course of this research, it was observed that the solution for robot kinematic parameter identification problem did not converge under the basic parameters of the PSO especially when there were more than three degrees of freedom, and most published works implementing the PSO for robot parameter identification either used lower DOF robots or the algorithms were enhanced, modified and hybridized usually for specific robot manipulators, these algorithms are often not applicable for other robot configurations. Therefore, the concept of a novel Mutating PSO (MuPSO) algorithm was conceived for analyzing robot kinematics. To the best of our knowledge, there has not been any research tailored at determining the best range of PSO parameters to develop an algorithm for robot analysis, therefore this work aims to develop a new PSO variant capable of analyzing all robot configurations and least likely to fall into stagnation. This was achieved by first studying the behavior of PSO under various robot configurations and determining a new range of parameters for robot kinematic analysis, then a suitable adaptive technique was investigated and finally, the mutation function was implemented. A total of 54 different PSO parameters were tested on 6 robot manipulators. The rest of this paper is organized as follows; Section 2 presents the kinematic model of the robot configurations to be studied and the fitness function for the algorithm was formulated. Experiments studying the behavior of these robot configurations under various parameters were conducted in Section 3, the new adaptive strategy is presented in Section 4, comparing it with other variations of PSO. The results are presented in Section 5, the Mutation function was introduced in Section 6 and in Section 7 conclusions were drawn, while Section 8 presents the future thrust.

#### **2. Kinematic Model of Robots**

To determine a new set of parameters for robot analysis, the effect of four popular robot configurations was studied under various parameters. The robot configurations include the Articulate, Stanford, SCARA, and Dual-Arm robot configurations. These robot configurations were implemented on six different robot manipulators; the articulate configuration was implemented on a 3DOF robot manipulator because it is regarded as the most complex spatial robot configuration. The articulate robot configuration was also implemented on two different 6DOF robot manipulators of different sizes to study the effect of size on the PSO parameters.

#### *2.1. Robot Configurations*

Industrial manipulators usually have 6DOF as this gives the manipulator optimum dexterity, allowing it to complete most tasks in an industrial workspace. A robot with less than 6DOF is deficient, as it is easier to analyze and control but it cannot reach all the possible positions and orientations in its workspace. A redundant robot possesses more than 6DOF, they are more flexible, capable of maneuvering behind obstacles but more expensive to analyze and control. The SCARA manipulator is an example of a deficient robot manipulator, articulate manipulators are usually 6DOF while the dual-arm robot is a redundant manipulator (usually greater than 6DOF). It would be keen to note that the presence of a prismatic joint in any robot configuration simplifies the analysis while complicating the solution, it requires less computation but, like redundant configurations, there is a possibility of having numerous or even infinite solutions to every problem. The joints and end-effector of robots are always oriented along the z-axis of the coordinate frame.


The second third and fifth joints are parallel to each other and perpendicular to the first joint axis. The fourth and sixth joints are coincident and perpendicular to all the other joints. Three articulated manipulators were used in this analysis, one 3DOF articulated manipulator and two 6DOF articulated manipulators with significantly different sizes. The 3DOF articulate manipulator is configured exactly like the first three joints of the 6DOF articulate manipulator previously described. Figure 1b shows the 3DOF while Figure 2a shows the 6DOF robot configurations, their D-H parameters are tabulated in Tables 3–5.


**Figure 1.** (**a**) 4DOF SCARA arm. (**b**) 3DOF Articulate arm.


**Table 1.** D-H Parameters for 4DOF SCARA arm.

For all DH parameters; <sup>1</sup> joint displacement is Theta, <sup>2</sup> off-set displacement of joint is Alfa, see Equations (7) and (8).




**Table 3.** D-H Parameters for 3DOF articulate arm.


**Table 5.** D-H parameters for large 6DOF articulate arm


**Figure 2.** (**a**) 6DOF articulate arm. (**b**) 6DOF Stanford arm.

The fourth, sixth, and eight joints are parallel to each other and perpendicular to the first joint while the second joint is perpendicular to all the other joints. Figure 3 illustrates the dual-arm robot manipulator while Table 6 shows the D-H parameters of the robot.

**Figure 3.** 17DOF dual-arm.


**Table 6.** D-H parameters for 17DOF dual-arm.

#### *2.2. Fitness Function*

The homogeneous matrix of each successive pair of frames can be obtained using the formula in (7) below from the D-H parameters. the transformation matrix *T* for the robot manipulator's end effector is a product of post multiplication as shown in the formula in (8).

$$T\_{k-1}^{k} = \begin{bmatrix} \cos \theta\_k & -\sin \theta\_k & 0 & a\_k \\ \sin \theta\_k \cos a\_{k-1} & \cos \theta\_k \cos a\_{k-1} & -\sin a\_{k-1} & -d\_k \sin a\_{k-1} \\ \sin \theta\_k \sin a\_{k-1} & \cos \theta\_k \sin a\_{k-1} & \cos a\_{k-1} & d\_k \cos a\_{k-1} \\ 0 & 0 & 0 & 1 \end{bmatrix}, \tag{7}$$

$$T\_0^{dof} = \begin{bmatrix} t\_{11} & t\_{12} & t\_{13} & t\_{14} \\ t\_{21} & t\_{22} & t\_{23} & t\_{24} \\ t\_{31} & t\_{32} & t\_{33} & t\_{34} \\ 0 & 0 & 0 & 1 \end{bmatrix} = \, T\_0^1 T\_1^2 \cdots T\_{dof-1'}^{dof} \tag{8}$$

$$A = \begin{bmatrix} -0.5161 & 0.2261 & 0.8262 & 349.5064 \\ 0.7185 & 0.6394 & 0.2739 & 1419.7 \\ -0.4663 & 0.7349 & -0.4924 & -516.5116 \\ 0 & 0 & 0 & 1 \end{bmatrix} \tag{9}$$

$$f\_{\rm ij} = |t\_{\rm ij} - a\_{\rm ij}|\tag{10}$$

$$Fitness \ = \sum\_{i}^{n-3} \sum\_{j}^{m-4} (f\_{ij} - E)\_{i} \tag{11}$$

If the actual values of *T* read from sensors attached to the robot's end-effector is given in *A*, then the fitness function (*Fitness*) would be as described in Equations (10) and (11). Where *k* - [1, 2, ... , dof], subscripts *i*, *j* - [1, 2, ... ,4] and E <sup>=</sup> <sup>1</sup> <sup>×</sup> <sup>10</sup><sup>−</sup>8. Most robots are usually fitted with encoders, gyroscopes, and current controllers which can measure joint position, end-effector orientation, and actuator currents (torque) respectively. There are a variety of sensors that can also be mounted manually on robots.

#### **3. Determining New PSO Parameters**

An experiment aimed at studying the behavior of all popular robot configurations on different PSO parameters was performed to identify the best values of *w* and *c* that balances out exploration and exploitation tendencies while ensuring convergence of results and also to determine the solutions with the best computational efficiency. The initial value of *w* (*wi*) was set to 0.7, increasing with an interval of 0.4 to a final value (*wf*) at 1.5, such that *w* = 0.7:0.4:1.5. The initial value of *c* (*ci*) was set at 1.5, increasing with an interval of 0.3 to a final value (*cf*) at 3.9, such that *c* = 1.5:0.3:3.9 as elaborated in Figure 4. Then the experiment was repeated for *c* = 1.4:0.6:2.6 and *w* = 0.6:0.3:3.0. The 6 robot configurations were tested with 54 sets of PSO parameters in 30 generations and 2000 iterations. The mutation function was not implemented in this experiment, the results were tabulated and the performance of the PSO plotted. In Table 7, the average and standard deviation of the best solutions after thirty runs and the solution that best minimizes the problem were presented, the average number of iterations required to find the best solution for each of the six robot manipulators was also presented at *w* = *0.7*. Tables 8 and 9 present a similar set of results for *w* at 1.1 and 1.5, respectively. To ease comparison and visualization of the results, a summary is presented in Table 10 showing the averages of normalized values obtained in Tables 7–9 while Figure 5a–f shows a pictorial plot of the performance of PSO for each of the robot manipulators. Similarly, Tables 11–14 and Figure 6a–f present the results and plots for the seco−nd experiment. The minimum solution for each robot manipulator problem was reported in this analysis because the average solutions (after 30 runs) usually reported in other works of literature does not completely capture the results obtained from the experiment especially in the SCARA, Stanford and dual-arm configurations where there is a possibility of multiple solutions to every problem. It can be shown that the average best solution alone is not enough to make a good comparison between the different scenarios. If the minimum solution presented in the tables represent the probability for the given parameters to find the minimum solution whereas the average best solution represents the probability for the solution to run into stagnation, then it can be shown that some parameters with very competitive average solutions are not capable of finding the minimum (global best) solution.

**Figure 4.** Flowchart of parameter selection experiment for the proposed PSO variant elaborating the end conditions and mutation criterion.



**Table 8.** Performance of PSO when *w* = 1.1




72




**Table 12.** Performance of PSO when *c* = 2.0.




**Table 14.** Aggregate of Performance for PSO for Different Values of Learning coefficient (*c*).


Performance Rank

**Figure 5.** Performance of PSO for different values of *w* (**a**) 3DOF articulate, (**b**) 4DOF SCARA, (**c**) 6DOF Stanford, (**d**) small 6DOF articulate, (**e**) big 6DOF articulate, (**f**) 17DOF dual-arm.

**Figure 6.** Performance of PSO for Different Values of *c* for (**a**) 3DOF Articulate (**b**) 4DOF SCARA (**c**) 6DOF Stanford (**d**) 6DOF Articulate (**e**) Larger 6DOF Articulate (**f**) 17DOF Dual-Arm.

While some other parameters with very poor average best solutions are capable of finding the minimum solution. For example, the results for the SCARA robot configuration in Table 11 shows that when *c* is 1.4 and *w* is between 2.4–2.7 the average best solution dominates the results obtained when *w* was between 1.2–1.5, yet the solution of the former cannot find the global minimum solution, and many more instances can be sited. Accuracy is very important in robot analysis, therefore the best solution should be able to find the global minimum solution always, followed by the solution that can find the global minimum at least once. Algorithms that may produce competitive averages yet incapable of locating the global minimum are regarded as poor solutions. Therefore, the minimum solution achieved by every pair of parameters was reported in the tables as the minimum solution. The aggregate performance of the PSO presented in Table 10, and Table 14 is an average of the normalized values obtained in the experiments such that the variables with higher normalized values have better performance, also a penalty was introduced when solving the average of normalized values, where binary probability distribution was used to replace the normalized minimum solution such that the solutions capable of finding the global minimum solution were assigned the value 1 while other undesirable results were assigned the value 0.

#### *3.1. Observations*

For the 3DOF Articulate robot configuration in Table 10, a random-like fluctuation in the performance of the PSO can be observed. When *w* was at 0.7 the best result was achieved at *c* = 1.8, it can also be observed that when *w* was increased to 1.1 the best results occurred at around *c* = 2.7 then when *w* was further increased to 1.5 the best result occurred at *c* = 1.5. However, Figure 5a shows that the performance of the PSO increases with increasing *w* and decreasing learning coefficient (*c*).

The performance of the PSO is somewhat stable when *c* was between 1.8–2.4 and *w* at 1.5 was found to be dominant. On the other hand, it can be seen from the Table 14 that the best result obtained for the PSO was when *w* was 1.2 and *c* was 1.4, then when *c* was increased to 2.6 the best result was observed at *w* = 0.9 which suggests that an improved result was achieved with an increasing *c* and a decreasing *w*. Figure 6a shows that the performance of the PSO algorithm deteriorates with increasing *c*. From Table 14 and Figure 6a, it can also be observed that the algorithm was stable when *w* was between 0.6 and 1.2 with *c* = 1.4 being dominant.

From Table 7, that when *w* = 0.7 and *c* < 2.7, the PSO algorithm was incapable of finding the global minimum solution for the higher DOF robots, likewise in Table 12 when *c* = 2.0 and *w* < 1.2. These observations confirm that the standard PSO (SPSO) is only capable of analyzing robot manipulator configurations with lower degrees of freedom. Because the dominant solutions are within the range of the SPSO, but as the DOF increases, the dominant solutions would be seen to deviate from the range of the SPSO.

In the 4DOF SCARA robot configuration, it can be observed from Table 10 that when *w* was equal to 0.7, the best result was obtained at *c* = *2.1* when *w* was increased and *c* decreased, the performance of the PSO was seen to increases as made evident in Figure 5b. The algorithm can be seen to be stable when *c* is between 1.8–2.4 and *w* at 1.5 dominating other solutions. Likewise, from Table 14 and Figure 6b, although it can be observed that the algorithm was stable when *w* was between 0.6–1.8 the best solution was recorded when *w* was between 0.9–1.2. In the initial stages when *w* was small *c* = 1.4 was dominant, but as *w* increased, *c* = 2.0 became the dominant solution.

From Table 10 it would be observed that the best result obtained for a 6DOF Stanford robot occurred when *w* = 0.7 and *c* = 3.9. When *w* was increased to 1.5 the best result was obtained with a decreasing *c* at 1.5. From Figure 5c the performance of the algorithm can also be observed to deteriorate with increasing *w*. the algorithm was stable when *c* was between 1.8–3.6 with *w* = 1.5 being the dominant solution when *c* was small, then *w* = 0.7 becomes dominant when *c* increases beyond 2.4. In Table 14 when *c* = 1.4 the best result for the 6DOF Stanford manipulator was obtained at *w* = 1.5, this value decreases slightly with an increase in *c* such that at *c* = 2.6 the best result occurred at *w* = 1.2. The algorithm was found stable between *w* = 0.6–1.5 and the solutions of *c* = 2.0 and *c* = 2.6 can be seen to compete for dominance. The dominant solution would lay be between *c* = 2.0−2.6.

The trend of decreasing inertia weight (*w*) and increasing learning coefficient (*c*) is observed to continue for the 6DOF Articulate manipulator configurations (both small and big) it can be seen from Table 10 that the best results were obtained at *c* = 3.6 when *w* was 0.7 for both configurations, this value reduced to *c* = 1.8 and *c* = 2.7 for the small and big manipulators respectively while *w* increased to 1.5. From Figure 5d,e, it can be observed that the algorithm remains stable when *c* was between 2.7–3.9 for both configurations with *w* = 0.7 being the dominant solution. It can also be observed from both plots that the performance of the algorithm also deteriorated with increasing *w*. An even stronger correlation is observed from Table 14 where the best results occurred at *w* = 1.5 for all values of *c* with a slightly decreasing *w* observed in the larger Articulated robot configuration. From Figure 6d,e, it would be seen that the algorithm is stable when *w* was between 0.9–1.8 and the dominant solution of *c* lies between 2.0–2.6. It would, therefore, be safe to deduce that the size of the robot manipulator has little effect on the PSO solution, therefore any PSO algorithm that can analyze a given configuration is most likely able to analyze the different varieties of sizes within the same configuration.

From Table 8 when *w* = 1.1 at *c* = 2.3 and *c* = 3.3, fluctuating results were recorded in the dual-arm configuration which is believed to be a result of stagnation in the algorithm.

Similar results were recorded in Table 13 when *c* = 2.6 at *w* = 1.5 and *w* = 2.4, and also in Table 12 when *c* = 2.0 at *w* = 0.6. Otherwise, in Table 10 the best results were obtained at *c* = 3.0 for *w* = 0.7 and *c* = 1.5 when *w* increased to 1.5. the best result occurred with a decreased value of *c* at 1.5. Figure 5f shows the performance of the algorithm deteriorates with an increasing *w.* although the algorithm remained consistent for almost all values of *c*, a sharp deterioration can be observed when *c* was 1.5–2.1. *w* = 0.7 is the dominant solution.

In Table 14 the best results were obtained when *c* = 1.4 at *w* = 1.8, the best results were further observed to occur at a decreased *w* and increased *c* such that when *c* = 2.0 the best result was obtained at *w* = 0.9 and when *c* = 2.6, the best result was obtained at *w* = 0.6 which also supports the theory of decreasing *w* with an increasing *c*. From Figure 6f, the performance was stable when *w* was between 0.9–2.1 with *c* = 2.0 being the dominant solution.

#### *3.2. Deductions*

The performance of the PSO algorithm was seen to improve with a decreasing inertia weight (*w*) and an increasing learning coefficient (*c*) in all the robot manipulator configuration except in the 3DOF Articulate manipulator. Fortunately, our analysis is more concerned with higher DOF robot manipulators, therefore techniques for decreasing *w* and increasing *c* shall be investigated. From the observations above for all robot manipulator configurations, the optimal *w* was 0.6–2.1 while *c* was 1.8–3.9. These best values were plotted and a fitted curve generated which shall be used to determine a suitable adaptive technique in the next section.

#### **4. Adaptive Computation Technique**

In robot analysis, the multi-swarm variations of the PSO are best suited for trajectory analysis especially in mobile robots where the robot would be required to track or follow a moving target. Kinematic/dynamic analysis of industrial robots generally have static search spaces, so this solution is not suitable considering the increased computational cost. Although the variation of the adaptive PSO which is dependent on the best solution (*PBest* or *GBest*) seems most promising as demonstrated in [45] but this requires knowledge of the established range of values for *w* and *c* that ensures exploitation and exploration. It was previously observed that the robot optimization problem does not converge under the conditions of basic parameters for most known EA. Therefore, this research was aimed at establishing a new set of parameters that ensure convergence. As such, the time-dependent variation of the Adaptive PSO shall be implemented for this analysis. Several time-varying algorithms have been reported in theory. A few of which shall be implemented in the foregoing experiment, a total of 13 distinct PSO algorithms shall be used for the experiment.

• PSO1: The linear decreasing inertia weight was reported in [40] where inertia weight (*w*) decreases linearly from 0.9 to 0.4, the governing equation for updating the *w* is

$$w\_{\rm iter} = w\_{\rm max} - \frac{w\_{\rm min} - w\_{\rm max}}{t\_{\rm max}} \times t\_{\rm iter} \tag{12}$$

$$c\_{iter} = \text{2.05},\tag{13}$$

where *wmax* and *wmin* are values of the initial and final inertia weight, *tmax* is the maximum number of iterations while *titer* is the current iteration.

• PSO2–3: A non-linear decreasing inertia weight was also reported in [50] with w decreasing linearly from 0.9 to 0.4. The governing equation for updating the inertia weight is

$$w\_{iter} = \frac{(t\_{\text{max}} - t\_{iter})^n}{t\_{\text{max}}^n} \times (w\_{initial} - w\_{final}) + w\_{final} \tag{14}$$

Observe that when *n* = 1, the inertia weight would be linearly decreasing as shown in Figure 7a, where *n* is a constant ranging from 0.9 to 1.3, the value *n* = 1.2 was reported as the recommended value for *n* in [50] but *n* = 3 was found to be more suitable for robot analysis. Therefore, the results for the two values *n* = 1.2 and *n* = 3.0 shall be presented in this experiment as PSO2 and PSO3, respectively.

• PSO4: A novel non-linear decreasing *w* and non-linear increasing c is hereby proposed. The parameters recorded for the best performance of PSO from the previous experiment were plotted and a fitted curve generated as shown in Figure 8. The non-linear technique presented in (15)–(17) exploits the experimental range of values for *w* and c, where *n* and *m* are problem dependent variables.

$$w\_{iter} = w\_{initial} \times n^{iter},\tag{15}$$

$$c\_1 = \text{2.24},\tag{16}$$

$$c\_2 = \frac{c\_{initial}}{m^{iter}}.\tag{17}$$

If the maximum number of iterations is 3000, then the values of the coefficients *n* and *m* can be easily determined. The parameter *w* in PSO4 shall be updated according on Equation (15) while *c1* and *c2* are updated according to Equation (16), where the learning coefficients are not adaptive therefore a reduced computation cost can be achieved, while the parameters in PSO11 shall be updated according to Equations (15)–(17) as originally proposed.

• PSO5–6: The concept of multi-stage decreasing inertia weight was introduced in [51], where *w* was decreased linearly from 0.9 to 0.4 in three distinct stages. The inertia weight first decreases from the initial value to a predetermined value *wm* where it remains constant for a while before decreasing further to the final value. As shown in Figure 7b, five different scenarios were presented, and the governing equation for updating the value of the inertia weight in each of the scenarios is given in Equations (18)–(22)

$$t\_1 = \begin{bmatrix} \ \frac{1}{5}t\_{\text{max},} & \frac{2}{5}t\_{\text{max},} & \frac{1}{5}t\_{\text{max},} & \frac{2}{5}t\_{\text{max},} & \frac{1}{5}t\_{\text{max},} & \frac{2}{5}t\_{\text{max}} \end{bmatrix} \tag{18}$$

$$t\_2 = \begin{bmatrix} \ \ \ \ \pm t\_{\rm max} & \ \ \stackrel{3}{5}t\_{\rm max} & \ \stackrel{4}{5}t\_{\rm max} & \ \stackrel{3}{5}t\_{\rm max} & \ \stackrel{4}{5}t\_{\rm max} & \ \stackrel{3}{5}t\_{\rm max} \end{bmatrix} \tag{19}$$

$$w\_{\rm tr} = \left[\begin{array}{c} \frac{4(w\_{\rm max} - w\_{\rm min})}{5} + w\_{\rm min}, & \frac{2.5(w\_{\rm max} - w\_{\rm min})}{5} + w\_{\rm min}, & \frac{(w\_{\rm max} - w\_{\rm min})}{5} + w\_{\rm min} \end{array} \right] \tag{20}$$

$$w\_m = \begin{bmatrix} w\_n(1) & w\_n(1) & w\_n(2) & w\_n(2) & w\_n(3) & w\_n(3) \end{bmatrix} \tag{21}$$

$$w(i) = \begin{cases} \left(w\_{\mathfrak{s}} - w\_{m}\right) \left(t\_{1}(i) - t\right) / t\_{1}(i) + w\_{\mathfrak{m}}(i) & 0 \le t \le t\_{1} \\ \left(w\_{\mathfrak{m}}(i) & t\_{1} \prec t \le t\_{2} \\ \left(w\_{\mathfrak{m}}(i) - w\_{\mathfrak{e}}\right) \left(t\_{\max} - t\right) / \left(t\_{\max} - t\_{2}(i)\right) + w\_{\mathfrak{e}} & t\_{2} \prec t \le t\_{\max} \end{cases} \tag{22}$$

The parameter for MLDIW5 was recommended in [51] for inertia weight but the parameters of MLDIW3 were found to be more suitable for the robot analysis. The results for the two values MLDIW5 and MLDIW3 shall also be presented as PSO4 and PSO5 respectively.

• PSO7: All the aforementioned algorithms exploited only the inertia weight, leaving the learning factor constantly at 2.05. In [52] and [36], a linear decreasing and linear increasing inertia weights were proposed respectively, both with decreasing cognitive component and increasing social component, thereby these techniques exploited both the inertia weight and learning coefficients. The technique reported in [52] shall be utilized in this experiment as it incorporates a linear decreasing inertia weight which is in line with our objectives and its parameters are updated according to Equation (12) above while the cognitive and social components shall be updated as

$$\mathcal{L}\_{\text{organic}} = (\mathcal{c}\_{\text{initial}} - \mathcal{c}\_{\text{final}}) \frac{t\_{\text{iter}}}{t\_{\text{max}}} + \mathcal{c}\_{\text{final}} \tag{23}$$

$$c\_{\rm social} = (c\_{final} - c\_{initial})\frac{t\_{iter}}{t\_{\rm max}} + c\_{initial} \tag{24}$$

The inertia weight of PSO8–13 shall be updated according to the equations for PSO1–6 respectively, while the learning coefficients shall be updated according to Equations (16) and (17). For the sake of fair comparison, the adaptive values of *w* for all the aforementioned techniques shall decrease from the initial value of 2.1 to a final value of 0.6, the cognitive component remains at 2.24 while the social component is nonlinearly increasing from 1.8 to 3.9.

**Figure 7.** (**a**) Plot showing the rate of change of inertia weight at different values of n (**b**) plot showing various techniques for multi-staged decreasing inertia weights.

**Figure 8.** Fitted curve of PSO parameters.

#### **5. Results**

For this experiment, the swarm size was again maintained at 200, a total of 13 adaptive PSO techniques were tested on all six robot manipulators with the maximum number of iterations for each run set at 3000 and a total of 30 runs each. As earlier presented in previous tables, the solution that best minimizes the problem (global minimum) was presented in Table 15 for every pair of robot configuration and PSO technique, the average and standard deviation of the best solution after 30 runs and the average number of iterations required to find the best solution are also presented. These values were normalized, summed, averaged, then ranked (sorted) such that the solution with the least rank possesses the best performance. The ranking of all tested PSO algorithms is presented in Table 16. The first column of Table 16 presents the PSO techniques according to their ranks. The second to the seventh column of Table 16 shows the individual ranks of all the PSO techniques against the six robot manipulators. In the second column, under 3DOF Articulate robot configuration, PSO5 has the best solution for the particular robot manipulator followed by PSO6, then PSO10 has the third-best solution, etc. The last column of Table 16 is the sum of all ranks presented in columns 2–7, PSO13 having the lowest total rank is considered the best result overall. During the course of the experiment, it was observed that:






**Table 16.** Ranking of the various adaptive PSO techniques.

#### **6. Mutation Function**

Structural bias in population-based algorithms is a characteristic that confines the search in a constrained search space. Replacing randomly selected samples with newly generated random samples enhances unbiased coverage of the search space [53]. In the mutating PSO algorithm, when the algorithm stagnates at a local optimum solution, a mutation operation is used to replace the swarm with new samples. The mutation function is an artificial perturbation of the system used to push the algorithm out of stagnation. The mutation probability was set at 100%, because if any solution from the previous iteration should remain, then that solution shall be the global best solution for the next iteration, therefore, the entire swarm would converge on that solution causing a recurring stagnating cycle. Four variables and two end conditions were introduced to train the algorithm to identify a stagnating solution. When the two conditions are satisfied, the algorithm terminates the iteration signifying that the actual solution has been identified, while if only one condition is satisfied, it signifies that the algorithm has run into stagnation and the mutation function is initiated. The abandonment threshold (*E*) is the global minimum solution. The *Fitness error* (*e*) is the difference between the current *Fitness* and the previous *Fitness* as elaborated in Equation (25), and the abandonment counter (*q*) monitors the second differential of *Fitness error*. When the second differential of the *Fitness error* becomes small than *E* then the algorithm is assumed to have slowed down, therefore, the condition in

Equation (26) states that when the difference in *e* is less than *E* then *q* begins to count consecutively through every iteration, and if the condition in (26) is broken then counter in *q* is reset to zero.

$$
\epsilon = \textit{Fitness}\_{t-1} - \textit{Fitness}\_{t\prime} \tag{25}
$$

$$q = \begin{cases} q+1, & \text{if } (c\_{i-1} - c\_i) < E\\ 0, & \text{Else} \end{cases} \tag{26}$$

$$f(\text{MuPSO}) = \begin{cases} \text{end} & \text{if } \quad q \ge Q \quad \text{and} \quad \text{Fitness} \le E\\ \text{mutate} & \text{if } \quad q \ge Q \quad \text{and} \quad \text{Fitness} > 1e^{-3} \end{cases} \tag{27}$$

The two end conditions in Equation (27) states that when *q* is equal to or greater than the abandonment limit (*Q*) and *E* is less than 1 <sup>×</sup> 10−<sup>8</sup> then the algorithm has found the global minimum solution and should be terminated, but when only the first condition is met then this signifies that the algorithm has run into stagnation. *Q* should be large enough so as not to prematurely terminate a solution allowing the algorithm to break out of stagnation but must also not be too large to allow a failed solution to continue. Table 17 shows the performance of the a few variants of PSO modified with the mutating operator. The proposed mutating PSO algorithm is presented in the first column, followed by the basic PSO (*w* = 1.0 and *c* = 2.05). The MLDIW-PSO with the enhanced parameters is in the third column as Mu-MLDIW, while the basic MLDIW (*w* = 0.9–0.4) is in the fourth. Likewise, the NLDIW-PSO with enhanced parameters is presented in the fifth column as Mu-NLDIW, and the basic NLDIW is in the last column. All these algorithms were further enhanced with the mutating operation.

Observe that all the basic PSO algorithms were able to analyze the 3DOF articulate and 4DOF SCARA manipulators, but unable to find the minimum solution for the higher DOF manipulators. At lower DOFs, although the basic PSO algorithms gave better results, the results from the proposed MuPSO are sufficient for robot analysis as seen in Table 18 where the joint parameters of the robots are identified with an accuracy of three decimal places.

Converging results were not obtained for the Stanford and the Dual-arm configurations. In a real experiment, the robot would be required to move from a known initial position and orientation *Ti* to a final destination *Tf*, therefore introducing more constraints like minimizing the distance traveled by joints or the energy consumption of joints may improve the convergence of prismatic and redundant configurations.





idealparameters'row,thevaluesinparenthesis representprismaticjointswhereapplicable.

#### **7. Conclusions**

Research into developing an intelligent swarm-based algorithm for robot analysis and parameter identification was proposed, experiments were performed to study the behavior of four popular robot configurations under various PSO parameters, and two anomalies were identified from the experimental results and successfully resolved. The anomalies were capable of masking poor solutions as good solutions. In this experiment, all the strong local minimizers have been identified; the 3DOF articulate configuration has only one strong local minimizer with Fitness = 2.0 at a position vector of [80, <sup>−</sup>40, <sup>−</sup>60]T. The 4DOF SCARA configuration also has only one local minimizer at Fitness <sup>=</sup> 4.0 with an infinite possibility of position vectors, while the other higher DOF robot configurations have at least five minimizers each. The minimizers are very sensitive, they are shifted by the slightest change in parameters therefore it is almost impossible to identify the stagnation points in real-time. An average solution is capable of breaking out of weak local minimizer but even the best solutions are helpless in the vicinity of a strong local minimizer. This is the basis of introducing the mutation function, to help break algorithms out of stagnation.

Since the algorithm can be taught to identify stagnating solutions, then the best solutions either find the global minimizer or stagnate at local minimum, but not linger without a solution. The two anomalies observed were capable of disguising poor solutions and confusing the algorithm therefore two penalties were introduced to help unmask poor solutions while distinctively identifying the best solutions. Some correlations were observed between the robot configurations and the various PSO parameters, a non-linear decreasing inertia weight and a non-linear increasing correction factor were adopted based on the experimental results. A new range of adaptive parameters were identified and implemented on the PSO algorithm. The algorithm was found to be capable of solving the robot kinematic problem for all four robot configurations. Algorithms from other works of literature were also modified with the newly identified adaptive parameters and compared with the proposed algorithm for solving robot kinematic problems. The proposed algorithm was found to dominate the other algorithms reported in the literature, succumbing to only the modified MLDIW-PSO that had the best overall performance, surpassing the runner up with large margins while the modified NLDIW and the proposed MuPSO algorithm closely contested the second position. More emphasis is on higher DOF configurations, therefore, if the lower DOF manipulators are ignored, the MuPSO would completely dominate the NLDIW-PSO in PSO10.

#### **8. Future Thrust**

The future aspirations of this work are to implement the algorithm in dynamic parameter identification of these robot manipulators, and also to compare the performance of the proposed algorithm with other metaheuristics on standard benchmark function. Testing the algorithm on benchmark functions would hopefully shed more light on the complex phenomena of modeling and control of non-linear dynamic systems. The algorithm described herein utilizes a time-dependent adaptive technique, a solution dependent (*PBest* or *GBest* based) adaptive technique seems more promising with better maneuverability, therefore since the range of parameters which ensure convergence of the robot dynamic problem has been established, it would be worthwhile to investigate a solution dependent algorithm for robot analysis. It has been established that even the best solution runs into stagnation, studying the initial conditions of the randomly populated swarm may give more insights on early identification of stagnating solutions so that the algorithm can be trained to completely avoid them.

**Author Contributions:** Conceptualization, A.U. and Z.I.B.F.; Methodology, A.U. and Z.I.B.F.; Software, A.U.; Validation, A.U., Z.I.B.F., and A.K.; Formal analysis, A.U.; Investigation, A.U.; Resources, A.K.; Data curation, A.U. and A.K.; Writing—original draft preparation, A.U.; Writing—review and editing, A.U. and Z.S.; Visualization, A.U., Z.S., and A.K.; Supervision, Z.S.; Project administration, Z.S.; Funding acquisition, Z.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Natural Science Foundation of Hebei Province of China, grant number F2017202243.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **An Improved Bytewise Approximate Matching Algorithm Suitable for Files of Dissimilar Sizes**

#### **Víctor Gayoso Martínez, Fernando Hernández-Álvarez and Luis Hernández Encinas \***

Institute of Physical and Information Technologies (ITEFI), Spanish National Research Council (CSIC), Serrano 144, 28034 Madrid, Spain; victor.gayoso@iec.csic.es (V.G.M.); fernando.hernandez@iec.csic.es (F.H.-Á.)

**\*** Correspondence: luis@iec.csic.es; Tel.: +34-91-561-88-06

Received: 21 February 2020; Accepted: 30 March 2020; Published: 2 April 2020

**Abstract:** The goal of digital forensics is to recover and investigate pieces of data found on digital devices, analysing in the process their relationship with other fragments of data from the same device or from different ones. Approximate matching functions, also called similarity preserving or fuzzy hashing functions, try to achieve that goal by comparing files and determining their resemblance. In this regard, ssdeep, sdhash, and LZJD are nowadays some of the best-known functions dealing with this problem. However, even though those applications are useful and trustworthy, they also have important limitations (mainly, the inability to compare files of very different sizes in the case of ssdeep and LZJD, the excessive size of sdhash and LZJD signatures, and the occasional scarce relationship between the comparison score obtained and the actual content of the files when using the three applications). In this article, we propose a new signature generation procedure and an algorithm for comparing two files through their digital signatures. Although our design is based on ssdeep, it improves some of its limitations and satisfies the requirements that approximate matching applications should fulfil. Through a set of ad-hoc and standard tests based on the FRASH framework, it is possible to state that the proposed algorithm presents remarkable overall detection strengths and is suitable for comparing files of very different sizes. A full description of the multi-thread implementation of the algorithm is included, along with all the tests employed for comparing this proposal with ssdeep, sdhash, and LZJD.

**Keywords:** approximate matching; context-triggered piecewise hashing; edit distance; fuzzy hashing; LZJD; multi-thread programming; sdhash; signatures; similarity detection; ssdeep

#### **1. Introduction**

Digital forensics is the branch of Mathematics and Computer Science in charge of identifying, recovering, analysing, and providing conclusions about digital evidence found on electronic devices. When inspecting the content of a computer or a mobile phone, the first step for the digital forensics expert is to reduce all the data available, extracting the important information that can be analysed more efficiently [1].

An initial strategy to obtain that reduction consists in using cryptographic hashing functions such as MD5 or SHA-1. Even though those algorithms are not recommended for cryptographic purposes, they are still valid for determining if two files are the same, considering that the probability for two files to have the same hash value is negligible. Precisely due to this reason, NIST (National Institute of Standards and Technology) developed a database in the early 2000s, called NSRL (National Software Reference Library), which contains hash values of sets of files of several trusted operating systems [2]. Nevertheless, these cryptographic hashing functions face a common problem when comparing files. If one of the files is modified even just in one byte, the final outcome of the comparison is negative.

In contrast to cryptographic hashing functions, approximate matching functions [3], also known as similarity preserving hashing (SPH) or fuzzy hashing functions, try to detect the resemblance between two files by linking similar inputs to similar outputs, indistinctly called in this context similarity signatures, fingerprints or digests [3]. These functions, which analyse files at byte level, are useful to compare a large variety of data and detect similar texts and even embedded objects (e.g., an image in a Word or OpenDocument text file) or binary fragments (e.g., a virus inside a file, a specific data packet in a network connection or similar content in audio files). By using a diversity of methods (for example, computations with a sliding window, as it will be described in next sections), approximate matching functions are able to identify if even a single byte is changed.

In computer forensics, ssdeep is the best-known bytewise approximate matching application, and it is considered by some researchers as the de facto standard in some cybersecurity areas [4]. The implementation of ssdeep allows the generation of the signature of files (where the generated signature depends on the actual content of the file) and to compare two signature files or one signature file with a data file. The results in both cases are the same, using one method or the other depends on the data available to the user. However, even though ssdeep represented an important achievement in similarity detection techniques, and it is relatively up to date (at the time of writing this text, the latest version is 2.14.1, released in November 2017 [5]), during the last years some limitations have been highlighted by several researchers, publishing different enhancements or alternative theoretical approaches (see, for example, References [6–13]).

In this article we have delved on the most important limitations of ssdeep, sdhash, and LZJD, which are mainly the impossibility of comparing files of very different sizes in the case of ssdeep and LZJD, the scarce relationship between the comparison score obtained and the actual content of the files when using the three applications in some cases, and the excessive size of sdhash and LZJD signatures. As a result of our research, we are presenting in this contribution a signature generation procedure based on the one implemented in ssdeep had but that overcomes its design limitations, together with an easy-to-implement algorithm for comparing the content of two files through their digital signatures. We can confirm that, based on the list of requirements that, in our opinion, any similarity search function should fulfil, our algorithm provides results better adjusted to different situations than ssdeep and is able to compare any pair of files regardless of their respective size. Besides, our proposal can manage some signatures that represent certain special cases that include different content swapping schemes.

In order to obtain meaningful results, in addition to ssdeep we have also considered sdhash and LZJD in our tests. sdhash is also very popular and is representative of a completely different theoretical approach in approximate matching algorithms, as it uses Bloom filters in order to find the features that have the lowest empirical probability of being encountered by chance [14]. On the other hand, LZJD is a recent alternative that uses the Lempel-Ziv Jaccard distance [15]. As a result of the comparison with sdhash and LZJD, we can state that our algorithm provides better results regarding the recognition of the proportion of a file which is present in another file while generating significantly smaller signatures in almost all the cases where those two algorithms are used.

This article represents an improved version of the information presented as a PhD thesis by one of the authors in 2015 [16], where SiSe (Similarity Search) was first described (this thesis has not been published but it is available as open access at http://oa.upm.es/39099/). Compared to that work, this contribution features a different signature comparison algorithm and new tests, some of them using FRASH (Framework to test Algorithms of Similarity Hashing), a reputed tool designed for comparing this type of applications. As another novelty, this article includes the design details of the multi-threading C++ implementation of the proposed algorithm (allowing researchers to inspect the code of our implementation,

uploaded to GitHub [17]), compares the application to the latest version of ssdeep (which in the last years has evolved and fine-tuned its capabilities), and broadens the comparative spectrum by adding LZJD to the test set.

The rest of this paper is organised as follows—Section 2 discusses the related work. Section 3 reviews the most significant features of ssdeep. In Section 4, we provide a complete description of our proposed algorithm. Section 5 describes the C++ implementation of SiSe, including the multi-thread design. Section 6 contains the ad-hoc tests that we have performed with a selected group of files in order to compare the similarity detection capabilities of our proposal to those of ssdeep, sdhash, and LZJD. Section 7 contains the tests performed with the FRASH framework and a well-known dataset. Finally, Section 8 summarises our conclusions about this topic.

#### **2. Related Work**

SPH functions can be divided into four categories [18]—Block-based hashing (BBH) functions, context-triggered piecewise hashing (CTPH) functions, statistically-improbable features (SIF) functions, and block-based rebuilding (BBR) functions, as described in the following paragraphs.

BBH functions generate and store cryptographic hashes for every block of a chosen fixed size (e.g., 512 bytes). In the next step, the block-level hashes from two different inputs are compared counting the number of common blocks and giving a measure of similarity between them. An implementation of this kind of fuzzy functions was developed by Nicholas Harbour, who created a program called dcfldd [19]. This function divides the input data into several blocks of a fixed length and calculates the corresponding cryptographic hash value for each of them. Even though this approach is very efficient from a computational point of view due to its simplicity, a single byte insertion or deletion at the beginning of a file could change all the block hashes, making this scheme too vulnerable.

The technique behind CTPH was originally proposed by Andrew Tridgell [20], who implemented spamsum, a context-triggered piecewise hashing application devoted to the identification of spam e-mails [21]. Its basic idea consists in locating content markers, called *contexts*, within a binary data object, calculating the hash of each fragment of the document delimited by the corresponding contexts, and storing the sequence of hashes. Thus, the boundaries of the fragments are based on the actual content of the object and not determined by an arbitrarily fixed block size. In 2006, based on spamsum, Jesse Kornblum developed ssdeep [22], one of the first programs for computing context-triggered piecewise signatures. This algorithm generates a matching score in the range 0–100. This score must be interpreted as a weighted measure of how similar these files are, where a higher result implies a greater similarity of the files [23].

The SIF approach is based on the idea of identifying a set of features in each of the objects under study and then comparing the features, where a feature in this context is a sequence of consecutive bits selected by some criteria from the file that stores the object. As a practical implementation of the concept, Vassil Roussev decided to use entropy in order to find statistically-improbable features [24]. With this idea, he proposed a new algorithm called sdhash, whose goal was to pick object features that are least likely to occur by chance in other data objects [25]. This algorithm produces a score between 0 and 100 that, according to its author, must be interpreted as a confidence value indicating how certain the tool is that the two data objects under comparison have non-trivial amounts of commonality [26].

A more recent example of this approach is LZJD, a proposal made by Edward Raff and Charles K. Nichola [15]. LZJD uses the same command-line arguments as sdhash but implements a different distance function, the Lempel-Ziv Jaccard distance [15]. According to its author, LZJD scores range from 0 to 100 and can be interpreted as the percentage of bytes shared by two files, which makes it comparable to the other applications considered in this contribution.

In turn, BBR functions use external data, which are blocks chosen randomly, uniformly or in a fixed way, in order to rebuild a file. The process compares the bytes of the original file to the chosen blocks and calculates the differences between them, using the Hamming distance or any other metric. Two of the best known implementations of this approach are bbHash [10] and SimHash [27].

It is important to mention that, while there are other high-level information retrieval techniques based on the semantic analysis of the content (e.g., see References [28–30]), SPH tools work at a lower level by directly analysing the value of the content as a byte sequence.

When analysing the similarity between two files, it is important to distinguish the concepts of resemblance and containment. While resemblance indicates how much an object looks like another one, containment indicates the presence of an object inside another one [31]. SPH applications adjust their results to either one of these concepts or both. As it will become clear later, our proposal uses a combination of both concepts.

Other interesting approach developed in recent years is TLSH (Trend Micro Locality Sensitive Hash) [32], a Locality Sensitive Hashing (LSH) application which is related to work done in the data mining area. LSH algorithms classify similar input items into the same groups with high probability, maximizing hash collisions [33]. However, while ssdeep, sdhash, and LZJD provide a similarity score between two digests in the range [0–100], in TLSH a distance score of 0 represents that the files are identical and higher scores represent a greater distance between the analysed documents. As TLSH does not imposes a limit on the score (according to the authors it can potentially go up to over 1000), it is not possible to directly compare its results to those of the other applications, so we decided not to include it in our tests.

#### **3. Review of Ssdeep**

The kernel of the signature generation algorithm in ssdeep is a rolling hash very similar to the one employed in spamsum [21] and rsync [34]. The rolling hash is used to identify a set of trigger points in the file that depend on the content of a sliding window of 7 bytes (i.e., the number and location of the trigger points totally depend on the content that is processed by the application, as every byte is taken into account in the calculations). The algorithm hits a trigger point every time the rolling hash (based on the Adler-32 function [35]), produces a value that matches a predefined condition. A second hashing function based on the FNV (Fowler-Noll-Vo) algorithm [36] is consequently used to calculate the hash values of the content located between two consecutive trigger points. Then, the last 6 bits of each hash value is translated into a Base64 character. The final signature is formed by concatenating the single characters generated at all the trigger points (with a maximum of 64 characters per signature).

The number of characters of the signature is strongly determined by the frequency of appearance of the trigger points. Therefore, the first step that must be completed by the algorithm is to estimate the value of the block size that would produce a final signature of 64 characters. More precisely, the value of the initial block size is the smaller number of the format 3 · <sup>2</sup>*<sup>n</sup>* (where *<sup>n</sup>* is a positive integer value) which is not lower than the value obtained after dividing by 64 the input size in bytes. Another condition is that if the length of the signature generated does not reach 32 characters, ssdeep modifies the block size and executes the algorithm another time to produce a new signature. This process is done repeatedly until the signature obtained is of at least 32 characters

The signature generation algorithm produces two signatures in order to be able to compare a higher number of files. The reason for that is that sometimes similar files have slightly different block sizes. The two signatures are known as the leading signature, which uses a certain value for the block size (the leading block size), and the secondary signature, whose secondary block size value is twice the value of the leading block size.

The signature format produced by ssdeep consists of a header followed by one hash per line. The content of the header used in its latest versions is the following:

ssdeep,1.1--blocksize:hash:hash,filename,

where the element ssdeep identifies the file type, 1.1 informs about the version of the file format (not to be confused with the version of the program), -- acts as a separator, and the remainder of the line identifies the elements displayed below the header (block size, primary hash, secondary hash, and filename).

In Information Theory, the edit distance between two strings of characters generally refers to the number of operations required to transform one string into the other. There are several ways to define the edit distance, depending on which operations are allowed. The ssdeep edit distance algorithm is based on the Damerau-Levenshtein distance between two strings [37,38].

That distance compares two strings and determines the minimum number of operations necessary to transform one string into the other. The only operations allowed in this distance comparison are insertions, deletions, substitutions of a single character, and transpositions of two adjacent characters [39,40].

In ssdeep, insertions and deletions have a weight of 1, substitutions a weight of 3, and transpositions a weight of 5. For instance, using this algorithm, the distance between the strings "Saturday" and "Sundays" is 5, as it can be seen in the following sequence, where the elements in bold format represent the characters that are added or removed at each step:


In practice, a consequence of assigning the value 3 to substitutions and 5 to transpositions is that the edit distance calculated only uses insertions and deletions. Therefore, substitutions and transpositions have a weight of 2 (a deletion followed by an insertion).

One of the limitations that this design has is that, for a specific string, a rotated version of it is assigned many insertion and deletion operations, when with regards to the content they are basically the same (i.e., the content is the same, although the order of the substrings is different). Consider for example the strings "abcd1234" and "1234abcd".

Finally, the score that ssdeep produces is normalised in the range [0, 100], where 100 is associated to a perfect match and 0 to a complete mismatch. If the two signatures have different block sizes, then ssdeep automatically sets the score to 0 without performing further calculations.

In addition to the previous information, ssdeep defines the minimum length of the longest common substring as 7. If the longest common substring length detected during the procedure is less than that value, then ssdeep provides a score of 0.

The source code of ssdeep and implementations for both Windows and Linux are freely available [5], so interested readers can inspect the code and test the application.

#### **4. Design of SiSe**

#### *4.1. Key Elements for Improvement*

Frank Breitinger and Harald Baier presented in 2012 a list of four general properties for similarity preserving hashing functions [41], which they later extended to the following five characteristics [11]:



Once the analysis of the characteristics of ssdeep was completed, we were able to identify a list of additional practical features (which are closely related to ssdeep's limitations) that, in our consideration, any bytewise approximate matching function should provide (see Reference [16]):


As the reader may note, the previous requirements can be divided into two groups associated to the procedures of signature generation and signature comparison. For the sake of clarity, we provide below (see Table 1) the two differentiated sets of requirements as they will be referred to in the rest of the document:


**Table 1.** Requirements that should be fulfilled by any approximate matching application.

We have established a similarity definition that considers an important problem either not completely taken into account or implemented in an unsatisfactory way by other functions: We believe that the similarity comparison should detect the inclusion of the content of a file in another file, but at the same time it should also indicate in some way which parts of the larger file are contained in the smaller file. This feature would be very useful in many contexts, such as plagiarism detection. For example, if we consider three files extracted from the same book (one with the first chapter, another with the first ten chapters, and the last one with the whole book), the comparison should work in the following way—the result of comparing the first and the second file should be different from the one obtained when comparing the first and the third file. This outcome is possible with our similarity definition, as it provides more granularity about the results.

#### *4.2. SiSe Signature Generation Procedure*

As mentioned before, the SiSe signature generation procedure is based on that implemented in ssdeep. For example, the generation of the initial block size in SiSe and in ssdeep are the same (i.e., the minimum length of the leading signature is half the initially expected length), using the same versions of the Adler-32 and FNV (Fowler-Noll-Vo) functions, which were also selected by spamsum because their performance was better compared to other cryptographic functions such as MD5 or the SHA family.

However, SiSe includes important modifications in order to be aligned with the requirements described above. One of the more obvious differences between SiSe and ssdeep is the generation of two characters per trigger point instead of only one, using the last 12 bits of the FNV output instead of the last 6 bits. With this decision, the main drawback is that SiSe signatures require twice the number of characters than ssdeep signatures for the same file and block size. Nevertheless, in order to keep a low percentage of false positives, from our perspective it is essential to increase the number of characters generated with the output of the FNV function.

Even though SiSe does not impose a limit on the maximum length for the signatures when generating them, for performance reasons we have imposed in its implementation an arbitrary limit of 2560 double characters when comparing signatures. That value has been selected so, in most signatures, it is not exceeded, as the expected number of double characters in SiSe (and of single characters in ssdeep) is 64.

Another difference with respect to ssdeep is that SiSe computes three signatures in each execution loop, with block sizes *b*, *b*/2 and *b*/4 (in Reference [12] there is a similar technique, but Chen and Wang use *b*, 2*b*, 4*b*, and 8*b* instead, so our algorithm implements the opposite approach). The reason behind this decision is that the experimental work described in Reference [9] (which employed 15,036 files of several types totalling 3.84 gigabytes) demonstrated that 67.3% of the tested files needed the main loop of the algorithm (i.e., the loop that scans the file at byte level and processes its full content) to be executed once, while 26.2% needed to run the main loop twice, and the rest of the files needed at least three loop executions. In the case of our design, it is possible to choose as leading and secondary signature the ones related to block sizes *b* and *b*/2, or to block sizes *b*/2, and *b*/4, after any execution of the main loop. With this, it is not necessary to perform any additional loop execution and, consequently, the running time will decrease significantly in some cases. Figure 1 illustrates the difference between ssdeep and SiSe regarding the number of main loop executions for each case.

**Figure 1.** Percentage of files that need a different number of executions of the main loop.

While, in theory, calculating a third signature increments the complexity of the implementation and, in principle, would increment the running time, the difference is negligible in practice, since our algorithm checks if a certain byte is a trigger point when using block size *b* only if that byte is a trigger point for both *b*/4 and *b*/2. Besides, our approach allows the completion of only one pass most of the times (more precisely, in more than 90% of the cases) in comparison to ssdeep, where two or more passes are necessary in almost one third of the cases.

In our implementation we have decided to use *b* and *b*/2 (or *b*/2 and *b*/4) instead of *b* and 2*b* as the relationships between the block sizes of the leading and secondary signatures because, in this way, the length of both signatures surpasses the minimum length (32 double characters). This fact becomes important when using the secondary signature in a comparison: If, for the sake of precision, it is important for the length of the leading signature not to have less than 32 double characters, then the secondary signature should satisfy the same requirement. In our case, any signature selected for a comparison will surpass the minimum length, improving consequently the accuracy of the comparisons. It is obvious that the disadvantage of this design is that, in most files, the length of the secondary signature is larger than the length of the leading signature (the smaller the block size, the higher the number of trigger points that are likely to be detected), incrementing the size of the file containing both signatures.

Algorithm 1 presents the details of SiSe's signature generation procedure [16]. The meaning of the different functions included in the algorithm is as follows:



#### **Algorithm 1** SiSe signature generation.

```
1: inputSize ← length(inputFile)
2: blockSize ← 3
3: dec ← 1
4: quotient ← floor(inputSize/64 )
5: while (blockSize < quotient) do
6: blockSize ← 2·blockSize
7: end while
8: sig1Len, sig2Len ← 0
9: blockSize1 ← blockSize
10: blockSize2 ← blockSize1/2
11: blockSize4 ← blockSize2/2
12: while ((sig1Len < 64) and (sig2Len < 64)) do
13: index ← 0
14: last1, last2, last4 ← -1
15: while (index < inputSize) do
16: currentByte ← subArray(baInput,index,index+1)
17: baWindow ← updateWindow(currentByte)
18: resultAdler ← Adler∗(baWindow)
19: if ((resultAdler % blockSize4)=(blockSize4 - dec)) then
20: baBlock4 ← subArray(baInput,last4+1,index+1)
21: resultFNV ← FNV∗(baBlock4)
22: pairChars ← codify(resultFNV)
23: sig4 ← append(sig4, pairChars)
24: last4 ← index
25: if ((resultAdler % blockSize2)=(blockSize2 - dec)) then
26: baBlock2 ← subArray(baInput,last2+1,index+1)
27: resultFNV ← FNV∗(baBlock2)
28: pairChars ← codify(resultFNV)
29: sig2 ← append(sig2, pairChars)
30: last2 ← index
31: if ((resultAdler % blockSize1)=(blockSize1 - dec)) then
32: baBlock1 ← subArray(baInput,last1+1,index+1)
33: resultFNV ← FNV∗(baBlock1)
34: pairChars ← codify(resultFNV)
35: sig1 ← append(sig1, pairChars)
36: last1 ← index
37: end if
38: end if
39: end if
40: end while
41: sig1Len ← length(sig1)
42: sig2Len ← length(sig2)
43: if (sig1Len < 64) and (sig2Len < 64) then
44: blockSize1 ← blockSize1/4
45: blockSize2 ← blockSize1/2
46: blockSize4 ← blockSize2/2
47: else
48: if (sig1Len < 64) then
49: return sig1,sig2
50: else
51: return sig2,sig4
52: end if
53: end if
54: end while
```
#### *4.3. SiSe Signature Comparison*

In addition to the changes in the signature generation process explained above and the multi-threading design, in this contribution we have implemented a new signature comparison algorithm which compares two signature strings.

Our work is related to other edit distance algorithms that allow moves (e.g., References [42–45]). However, those contributions are mainly focused on the performance of such algorithms, and in many cases they impose important restrictions during the comparison process. For example, in Reference [42] the strings under comparison must be of equal size and must contain exactly the same characters; Reference [43] makes his study assuming that each letter occurs at most *k* times in the input strings; in Reference [44] the original position (*p*1) and the final position (*p*2) of the substring to be moved must satisfy a limiting rule (namely, if *l* is the length of the substring, it is necessary that *p*<sup>2</sup> is either smaller than *p*<sup>1</sup> or larger than *p*<sup>1</sup> + *l*). Regarding Reference [45], it uses a different technique based on the embedding of strings into a vector space and using a parsing tree.

Compared to the previously mentioned contributions, our algorithm manages text units comprised of two characters, while the rest of related algorithms, to the best of our knowledge, manage single characters, which allows us to improve the detection rate, as the probability of two pairs of characters being the same is 1/4096 (the two Base64 characters of a pair use 12 bits) instead of 1/64, as it is the case when using only one Base64 character (and the 6 bits associated to it).

Algorithm 2 shows all the calculations performed to calculate numerically the similarity of the strings sig1 (with length sig1Len) and sig2 (with length sig2Len), where the first step consists in creating a two-dimensional array of size (sig1Len + 1) × (sig2Len + 1). That array, called arrComp, initially contains the value 0 in all of its positions, but is updated whenever a double character of the first string matches a double character of the second string (only if both double characters are located in an even position, considering that the first position of a string is 0). The arrays identified as arrx and array mark if a character has already been taken into account during the computation of the score (a value 0 means that the character has not be considered for the score up to that point, whilst a value 1 means that the character belongs to a substring included in both input strings that has already been used during the calculation of the score).

```
Algorithm 2 SiSe signature comparison.
```

```
1: for all i such that 1 ≤ i ≤ sig1Len do
2: for all j such that 1 ≤ j ≤ sig2Len do
3: if sig1[i − 1] = sig2[j − 1] then
4: if (i%2 = 1) and (j%2 = 1) then
5: if sig1[i]=sig2[j] then
6: arrComp[i][j] = arrComp[i − 1][j − 1] + 1
7: end if
8: else
9: if (i%2 = 0) and (j%2 = 0) then
10: if arrComp[i − 1][j − 1] > 0 then
11: arrComp[i][j] = arrComp[i − 1][j − 1] + 1
12: if i <sig1Len and j <sig2Len then
13: if sig1[i] = sig1[j] then
14: list.add((arrComp[i][j],i,j))
15: end if
16: else
17: list.add((arrComp[i][j],i,j))
18: end if
19: end if
20: end if
21: end if
22: end if
23: end for
24: end for
25: index1 ← 0, index2 ← 0
26: for all item in list do
27: if arrx[item.posRow]= 0 and arry[item.posCol]= 0 then
28: points = points + item.val
29: for all index1 such that (item.posRow - item.val + 1) ≤ index1 ≤ item.posRow do
30: for all index2 such that (item.posCol - item.val + 1) ≤ index2 ≤ item.posCol do
31: if ((arrx[index1]= 1) or (arry[index1]= 1)) then
32: points = points - 1
33: end if
34: arrx[index1] = 1
35: arry[index2] = 1
36: end for
37: end for
38: else
39: dec = item.val;
40: for all index1 such that item.posRow ≤ index1 ≥ item.posRow - item.val+1 do
41: for all index2 such that item.posCol ≤ index2 ≥ item.posCol - item.val+1 do
42: if arrx[index1]= 1 or arry[index2]= 1 then
43: dec ← dec - 1
44: else
45: break
46: end if
47: end for
48: end for
49: if reduced>0 then
50: points = points + dec
51: for all index1 such that (item.posRow-item.val+1) ≤ index1 ≤ (item.posRow-removed) do
52: for all index2 such that (item.posCol-item.val+1) ≤ index2 ← (item.posCol-(item.val-dec)) do
53: if arrx[index1]= 1 or arry[index2]= 1 then
54: points ← points - 1
55: end if
56: arrx[index1] ← 1
57: arry[index2] ← 1
58: end for
59: end for
60: end if
61: end if
62: end for
63: return (points*100)/max(sig1,sig2)
```
Table 2 shows an example of such an array when comparing the strings A1B2C3D4F8 and 1A1BC3D4F7A1. As it can be observed, the substring 1B is not credited any point because in one string it starts at an odd position while in the other string it starts in an even position, so they do not produce a match.


**Table 2.** Comparison array example.

Once all the characters are compared and an array such as the one shown in Table 2 is completed, a list is created with elements that include three values: The highest value assigned to a series (displayed in bold in the table), and the row and column associated to that value. In the example, two elements would be added to the list, (4, 8, 8) and (2, 12, 2). Those elements, representing all the sequences of double characters (located at valid positions) are then processed by the second part of the algorithm in order to increase the number of points assigned to the comparison and remove those substrings from the comparison, so they are not taken into account more than once.

In the example shown, the final score would be 50, as 50% of string 1A1BC3D4F7A1 can be found in string A1B2C3D4F8 (more specifically, the double characters A1, C3, and D4 are shared by both strings, while the two strings contain three additional double characters that are not common).

#### **5. SiSe Implementation**

#### *5.1. Interface*

We have implemented our design of SiSe as a C++ multi-threaded command-line application that uses the features of the C++17 standard [46] and the OpenMP library [47]. The application implements a dual input command format: One adapted to the FRASH tests, and a proprietary one. Regarding the input format compatible with FRASH, SiSe can be used without modifiers with one or several files separated by spaces. This indicates SiSe that it must treat those files as content data and generate their signatures individually. In turn, if modifier -x is used along with the name of a file, it indicates the application that the file contains multiple signatures and that it must compare all signatures against the rest. Finally, if modifier -r is employed together with the path of a directory, SiSe will compute the signature of all the files contained in that folder.

In comparison, the proprietary input command format used by SiSe for creating a new signature is sise -i input\_file, which allows the following additional optional elements:





Independently of the input command format used, the signature procedure creates a text string containing a header and two signatures (the leading and the secondary signatures). The header of the initial version is SiSe-1.0--7:1--blocksize:hash,filename, where the elements 7 and 1 are replaced by other numbers when non-default values selected by the user are employed for the window size and the decremental value, respectively.

The first signature that follows the header is the leading signature. Both signatures start with the block size and the value used in the modulus comparison to identify the trigger points (i.e., the block size minus the decremental value), the last one located between parenthesis and followed by a colon. The next item that appears is the actual signature string, corresponding to the previously stated block size. A double colon :: separates both signatures. Finally, the complete path of the file that has been processed (written between quotation marks) appears separated from the previous information by a comma.

In order to illustrate the signature generation process, Table 3 shows the output provided by ssdeep and SiSe when processing the plain text file containing Robert Louis Stevenson's Treasure Island [48]. As it can be noted, the character that appears in the second position of each pair in the secondary signature of SiSe (the one with a block size of 3072) matches the corresponding character of the leading signature of ssdeep. This fact happens as the block size, windows size, and decremental value is the same for those two signatures. The coincidence in the series of characters ends at the 63th character of the ssdeep signature, as at that point ssdeep computes its final character using the rest of the document, independently of the length of that remaining part and the number of trigger points included in it. In contrast, SiSe continues computing all the characters derived from the presence of additional trigger points.

**Table 3.** ssdeep and SiSe signature examples.


> sise pg27780.txt

> ssdeep pg27780.txt

SiSe-1.0--7:1--blocksize(modvalue):hash,filename 6144(6143):k2C7lYMp6KJw/hiKodceZoVDD//JfVrjOQVTT1GJA+Ir8O+SrfGiEtMpq4UpucU7::3072(3071):Q7 BEUlUbaUe+1TInKgvluXnXkzMpDfLu12J/j8ZFfbwou0YAxNXtXDU3NxiKodceLSf/gGxVCcvOi/PbSLDeD/EsZVuc zd+Jyq/ZC/M7fyH0aOG3JFaF1hc1Kgh+VMZVbrXJR/HrT1GJA+xYR6cvz7a58OIfvpEBjAVAA+x3Tu0B+gTTWOEtn3 pfQ2wLkqx9lFhovSu4uc1ePk,"/media/victor/USB/pg27780.txt"

The proprietary format of the command for comparing signatures and/or files is sise -c input\_file\_1 input\_file\_2, which allows these combinations:

• *Two signature files*: The comparison between signatures can be done only if they have at least one block size value in common. Additionally, the decremental value and the window size must be equal in both signature files. If the leading signatures use the same block size value, and the secondary signatures also employ the same block size value (but different from the block size of the leading signatures) at the same time, SiSe is able to apply the edit distance algorithm (see Algorithm 2 in Section 4.3) in both cases, and returns the highest score calculated.


#### *5.2. Multi-Thread Design*

The C++ implementation of SiSe uses the OpenMP library [47] both during the signature creation and the signature comparison procedures. Multi-threading is employed in the comparison phase when comparing digests using the -x option, that is, when requesting SiSe to compare all the signatures included in a file with the rest of signatures of that file. In that scenario, SiSe distributes the work between a number of threads that equals the number of logical cores of the processor, where the number of logical cores is the number of physical cores multiplied by the number of threads that can run on each core through the use of hyperthreading.

Multi-threading is also used during the signature generation process in two different ways. For any file larger than 1 megabyte, the strategy consists in dividing the file into several parts so each thread is assigned a portion of the file. Given the nature of the operations that each thread must perform, the multi-thread design tries to minimize the amount of bytes that are processed twice. Nevertheless, the double processing of some portions of the file cannot be avoided, as it will become clear with the description included below and Figure 2. Besides, when a command requests to process more than one file, SiSe creates a list of files whose size is less than 1 megabyte and assigns one thread per file up to the thread limit.

**Figure 2.** Representation of the multi-thread design.

Coming back to the multi-threading scheme for a file larger than 1 megabyte, when using *n* threads that file is divided into *n* parts, so each thread starts processing bytes at a different offset inside the file. If the working thread is thread #1, the first triggered point found by the thread provokes the computation of the first pair of characters of the signature. However, if the working thread is not thread #1, when the

thread detects the first trigger point it adds no pair of characters to the signature, as at that point the thread lacks the knowledge of where the previous trigger point was located. From that moment on, all the threads work as expected and, whenever they find a trigger point, they compute the corresponding two-character element of the signature. Threads continue to work in that way and, except the last thread (which finishes its execution when it reaches the end of the file, computing at that moment the pair of characters associated to the end of the signature), they only stop when finding the first trigger point located after the byte that marks the theoretical end point of their respective file partitions.

This scheme brings as a consequence a certain overlapping of parts of the file processed both by threads *i* − 1 and *i*, where the overlapped portion is read by thread *i* − 1 at the end of its execution and by thread *i* at the beginning of its execution. The extent of the overlap depends on the content of the file, its size, and the number of threads. Figure 2 provides an example that illustrates the multi-thread scheme when four threads are used.

As the detection of trigger points depends on the value of the byte being processed and the value of the previous 7 bytes, before processing the first byte each thread fills in the sliding window array with the proper 7 bytes, so no false trigger points appear and thus the signatures are guaranteed to be the same independently of the number of threads used during the process.

Regarding the optimal number of threads to be employed by the application, after several tests it was decided to use one thread with files whose size is less than 1 megabyte and a number of threads that equal the number of logical cores in the rest of cases. The reason for doing this is that, for small files, the cost of setting up the threads and aggregating the results provided by them does not compensate the benefit of a reduced workload per thread. As for the buffer size used when reading the files, during the tests a buffer of one megabyte proved to be the best option.

#### **6. Ad-Hoc Tests**

This section shows the results of the comparisons performed with SiSe, ssdeep, sdhash, and LZJD in different scenarios designed by ourselves to verify the features associated to the characteristics described in Section 4.1.

We have classified the tests in four categories: Similarity tests, dissimilarity tests, special signatures tests, and suitability tests. For the first two groups, we have tested plain text files, Word documents, and BMP images. The tests with special signatures are intended to check the behaviour of the four algorithms in some special cases. The suitability tests try to determine the practicability of the applications in real-world scenarios by measuring the performance and signature size using files whose length span from less than 1 megabyte to several gigabytes.

As mentioned in Section 4, after performing several tests modifying the values of the block sizes, window sizes, and decremental values, we concluded that the results did not show appreciable differences. For this reason, we decided to use the default parameters (sliding window of 7 bytes, block size computed by the algorithm, and decremental value equal to 1) in all the tests that follow. Moreover, as they are the same values that ssdeep use, it is easier to derive conclusions from the results.

It is important to notice that, in all the tables related to tests in which two files are employed, the identifier linked to the row represents the first input argument and the identifier linked to the column represents the second input argument.

Tables associated to SiSe include four values in each cell. Those values are obtained when comparing two hash files, a content file and a hash file (in that order), a hash file and a content file (in that order), and two content files, respectively.

The values associated with the tests with ssdeep can be obtained either by comparing two signature files by means of the command ssdeep -ak sigfile1 sigfile2 or by comparing a signature file and a

content file (in that order) using the command ssdeep -am sigfile contentfile. As in all the tests both methods provide the same numerical value, in the tables related to ssdeep we have included each value once, even though all the tests have been performed separately with both commands.

Results can be obtained with sdhash and LZJD through two procedures. In the first one, a signature file with the hashes of all the files has been generated with the command sdhash \* > sigfile.sdbf and, as a second step, that file has been used as input for the command sdhash -c sigfile.sdbf that compares all the hashes included in the file. In the second case, the results have been directly generated by using the command sdhash -g \*. When using LZJD, the command sdhash must be replaced with the command LZJD.jar, as the author of LZJD recommends using the Java version of its algorithm [49]. In the tables with the sdhash results only one value appears in each cell, as the result obtained with the two procedures is exactly the same. In comparison, LZJD provides different values when using the two procedures with the same files, so in that case we have preferred to include both values in the corresponding table.

It is interesting to note that, while the ssdeep matrices are always symmetric, that is not always the case with the matrices related to SiSe. The reason for that fact is that the second score of each cell represents a test where a signature file (pertaining to the file identified by the row of the table) is compared against the content of a data file (whose identifier is at the top of the column that contains the cell with the score). According to its predefined behaviour, SiSe imposes the leading block size of the file associated to the row when processing the signature of the file associated to the column. If the leading block size of the file associated to the row is smaller than the theoretical leading block size of the file associated to the column, a signature (which theoretically is longer than the original signature that would have been generated with SiSe) is produced by the application. On the other hand, when the leading block size of the file associated to the row is larger than the theoretical leading block size of the file associated to the column, it could happen that, when that block size is forced on the file associated to the column, no signature of the minimum length could be generated. In this case, SiSe is not able to perform the comparison. In the rest of the cases there might be small differences in the outcome produced by SiSe for the second test (i.e., the second number displayed in every cell) in both comparisons (i.e., file A compared to file B, and file B compared to file A), depending on the leading block size used in each case. The same situation appears in the tests associated to the third score. However, when using two signature files or two data files (i.e., the first and fourth scenarios) the results are always symmetric for each comparison.

As a final comment, for better legibility in the tables, we have discarded the results related to the comparison of any file with itself.

#### *6.1. Similarity Tests*

#### 6.1.1. Plain Text Documents

In this test, plain text files with the first 20 chapters of Miguel de Cervantes' Don Quijote, in the version offered by The Project Gutenberg [50], have been used. These files are the following: Q01.txt (10,887 bytes, Quijote's chapter 1), Q02.txt (23,877 bytes, chapters 1, 2), Q03.txt (37,342 bytes, chapters 1–3), Q04.txt (51,374 bytes, chapters 1–4), Q05.txt (60,526 bytes, chapters 1–5), Q10.txt (125,217 bytes, chapters 1–10), Q15.txt (204,198 bytes, chapters 1–15), and Q20.txt (305,527 bytes, chapters 1–20). Table 4 shows the percentage of the smaller file as a part of the larger file using the byte size as the comparison value, as in this test the smaller files are totally contained in the larger ones.


**Table 4.** Percentage of the larger file representing the smaller file with content in plain text format.

Tables 5–9 present the outcomes of the tests using SiSe, ssdeep, sdhash, and LZJD. As it can be seen, SiSe is the application that offers results closer to the percentages included in Table 4, providing values better adapted to the similarity definition presented in Section 4.1. In addition to that, it is important to note that SiSe provides results (with two of its four operation modes) in several cases where ssdeep and LZJD return a value of 0, as for example in the comparison between Q01.txt and Q04.txt, the comparison between Q02.txt and Q10.txt or the comparison between Q05.txt and Q15.txt. The values generated by sdhash in those cases are 95, 99, and 100, respectively, which inform that the smaller file is definitely included in the larger file, but do not provide details about how much content of the larger file is replicated in the smaller file.


**Table 5.** Test results for similar plain text files with SiSe, part 1 (hash vs. hash/file vs. hash/hash vs. file/file vs. file).


**Table 6.** Test results for similar plain text files with SiSe, part 2 (hash vs. hash/file vs. hash/hash vs. file/file vs. file).

**Table 7.** Test results for similar plain text files with ssdeep.


**Table 8.** Test results for similar plain text files with sdhash.



**Table 9.** Test results for similar plain text files with LZJD (signature comparison/file comparison).

There are two situations in which SiSe does not provide a comparison result. The first one happens when comparing two files with incompatible signatures. The second situation appears when comparing a small content file with the hash of a large file, as in that case the block size used in the comparison is too large for generating a valid signature in the smaller file. When the two other situations appear (comparing two content files or a large content file with the hash of a small file), valid results are always produced.

#### 6.1.2. Word Documents

In this second test, the same book as in the previous test was used, although the chapters of Don Quijote have been saved as Microsoft Word 2016 using the identifiers Q01.docx (17,637 bytes), Q02.docx (24,140 bytes), Q03.docx (30,596 bytes), Q04.docx (37,422 bytes), Q05.docx (41,895 bytes), Q10.docx (68,691 bytes), Q15.docx (105,228 bytes), and Q20.docx (150,969 bytes).

Table 10 shows the percentage of the smaller file as a part of the larger file using the byte size as the comparison value, as in this test the smaller files are totally contained in the larger ones. The values included in this table are slightly higher than the values presented in Table 4 due to the internal format used by Microsoft Word.

**Table 10.** Percentage of the larger file representing the smaller file with content in Microsoft Word format.


Tables 11–15 present the results obtained with SiSe, ssdeep, sdhash, and LZJD, respectively. With the information included in those tables, it is obvious that SiSe can compare more files than ssdeep. The values provided by SiSe in this test are lower than the values generated in the previous test, which is once again due to the internal structure of Word documents and the metadata that is contained in each file.

In the case of sdhash, it can be said that the values shown in Table 14 do not reflect the trend that can be visualised with SiSe where, for a specific file (e.g., Q01.docx) the comparison score decreases as the other file used in the comparison contains a larger portion of the book (Q02.docx, Q05.docx, Q10.docx, etc.), as the content of the initial file is diluted in the larger files.

Finally, it is worth mentioning that LZJD fails to produce results above zero in many cases.

**Table 11.** Test results for similar Word documents with SiSe, part 1 (hash vs. hash/file vs. hash/hash vs. file/file vs. file).


**Table 12.** Test results for similar Word documents with SiSe, part 2 (hash vs. hash/file vs. hash/hash vs. file/file vs. file).



**Table 13.** Test results for similar Word documents with ssdeep.

**Table 14.** Test results for similar Word documents with sdhash.


**Table 15.** Test results for similar Word documents with LZJD (signature comparison/file comparison).


#### 6.1.3. BMP Images

This test uses the three images shown in Figure 3. The first image is the Lenna's classic greyscale test image [51]. For the second and third images, although the image is the same, we have rearranged half of it horizontally and vertically, respectively. The three images are bitmaps of the same resolution, therefore their size is the same (786,486 bytes). Readers should note that, in those individual BMP images, each pixel is represented by 24 bits, and that the pixel content is stored sequentially in the file. Pixels are read

processing rows from top to bottom and, for any given row, from left to right (which means that the first pixel stored in the file is the one located at the upper left corner of the image, and the last one is the pixel located at the bottom right corner of the file).

**Figure 3.** BMP test images.

The results obtained with the four applications are displayed in Tables 16–19, where we have used the identifiers BMP 1, BMP 2, and BMP 3 for the three images.

**Table 16.** Test results for similar BMP images with SiSe (hash vs. hash/file vs. hash/hash vs. file/file vs. file).




**Table 18.** Test results for similar BMP images with sdhash.



**Table 19.** Test results for similar BMP images with LZJD (signature comparison/file comparison).

Using the results obtained, we can conclude that ssdeep is not able to match the image vertically rearranged with the two other images. In contrast, SiSe obtains results in both cases. Additionally, SiSe assigns a larger similarity percentage when comparing the first and second images (98%) than ssdeep (52%) which, in our opinion, is closer to reality given our similarity definition and the content of the files. On the other hand, the results obtained when processing the third image with sdhash are better adapted to our similarity definition, but sdhash fails to identify the first two images as having basically the same content. Regarding LZJD, its results follow the trend of sdhash but with scores significantly lower that those of sdhash. The reason for SiSe providing a lower result when comparing the third file to the other ones than when comparing the second file to the third one, is that due to the way a BMP file stores its pixels, a lot more trigger points are shared by the first and second images.

#### *6.2. Dissimilarity Tests*

The aim of this group of tests is to detect undesirable false positives by comparing files of the same format but with different content. Given the results obtained, no file is interpreted as strongly related to the other files included in the same test, as the maximum value obtained through those tests is 10 (produced by sdhash).

#### 6.2.1. Plain Text Documents

The plan text files used in this test contain the following books as offered by The Project Gutenberg: H. G. Wells' The Time Machine [52] (201,875 bytes), Miguel de Cervantes' Don Quijote [50] (2,198,927 bytes), Robert Louis Stevenson's Treasure Island [48] (397,415 bytes), and Jules Verne's Voyage au Centre de la Terre [53] (460,559 bytes).

In this test ssdeep and LZJD did not provide any positive results when comparing the different files. The results obtained with SiSe and sdhash are offered in Tables 20 and 21.


**Table 20.** Test results for dissimilar plain text files with SiSe (hash vs. hash/file vs. hash/hash vs. file/file vs. file).


**Table 21.** Test results for dissimilar plain text files with sdhash.

SiSe obtains a maximum value for dissimilar files of 5, which implies that in order to discard false positives a threshold could be established. In the case of ssdeep, it returns a value of 0 for every comparison without processing its edit distance algorithm. The reason for that is that the corresponding signatures do not have a substring with at least 7 characters common, and hence it directly returns a value of 0. The comparisons performed by sdhash show values that are slightly higher than the ones provided by SiSe, but without a significant difference.

#### 6.2.2. Word Documents

For this test, new Microsoft Word 2016 (Microsoft Corporation, Redmond, WA, USA) files have been created with the same content of the books used in the previous test. The size in bytes of each file is 118,169 (Machine.docx), 1,918,754 (Quijote.docx), 372,941 (Treasure.docx), and 263,316 (Voyage.docx).

While ssdeep and LZJD did not provide any positive results when comparing the different files, SiSe and sdhash returned the residual values included in Tables 22 and 23.




**Table 23.** Test results for dissimilar Word files with sdhash.

As in the case of dissimilar plaint text files, the values returned by both SiSe and sdhash are very low and could be avoided by establishing a small threshold.

#### 6.2.3. BMP Images

In this test, three well-known test colour images obtained from Reference [54] have been used. The first one is Lenna's portrait (lenna.bmp), the second one is the photograph of a combat jet (airplane.bmp),

and the third one displays several vegetables (pepper.bmp) as it can be seen in Figure 4. The resolution of the three files is 512 × 512 pixels, and they use 24 bits per pixel, so their size is 786,486 bytes.

**Figure 4.** Color BMP test images.

While ssdeep and LZJD did not provide any positive results, sdhash returned a value of 1 when comparing pepper.bmp to the other two files and SiSe also returned a value of 1, but only when comparing lenna.bmp and pepper.bmp. Thus, the results are aligned with the expected behaviour, as the content of those files is clearly different.

#### *6.3. Special Signatures*

The aim of this test is to verify the behaviour of SiSe when it has to process some special signatures which represent cases that cannot be directly translated to regular input data files. Even so, we believe that it would be worthwhile to test SiSe and ssdeep in this scenario, as it represents different degrees of content modification. This test was not carried out with sdhash and LZJD as it was impossible to verify if the forged signatures truly reflected the nature of the tests, since their signature generation process is completely different.

Once again, it is important to point out that these special signatures do not represent actual files, instead they have been developed ad-hoc taking into account that the minimum length had to be 32 characters so they could be used with ssdeep. The special signatures for ssdeep are as follows:

S01: ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz S02: ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZ S03: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ S04: 12345678901234567890123456ABCDEFGHIJKLMNOPQRSTUVWXYZ S05: BADCFEHGJILKNMPORQTSVUXWZYbadcfehgjilknmporqtsvuxwzy S06: CDABGHEFKLIJOPMNSTQRWXUVabYZefcdijghmnklqropuvstyzwx S07: EFGHABCDMNOPIJKLUVWXQRSTcdefYZabklmnghijstuvopqrwxyz S08: IJKLMNOPABCDEFGHYZabcdefQRSTUVWXopqrstuvghijklmnwxyz S09: QRSTUVWXYZabcdefABCDEFGHIJKLMNOPwxyzghijklmnopqrstuv S10: ghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZabcdef

The base element for this set is the first signature, S01. The second signature, S02, duplicates the first half of S01 in the second half. Signature S03 swaps the two blocks of S01. Additionally to this change, signature S04 replaces the first half of S03 with a string of digits. The remaining signatures, from S05 to

S10, have been created taking S01 and then applying transpositions of blocks whose sizes in characters are is 1, 2, 4, 8, 16, and 32, respectively.

Bearing in mind that SiSe signatures use two characters per trigger point, the special signatures of ssdeep have been adapted accordingly. In that sense, each character have been doubled (e.g., A and a have been transformed into AA and aa, respectively) in the signatures employed with SiSe.

The results obtained when comparing these special signatures are included in Tables 24 and 25.


**Table 24.** Test results with ad-hoc signatures and SiSe.

**Table 25.** Test results with ad-hoc signatures and ssdeep.


From the results obtained, it can be concluded that SiSe provides meaningful results in all the comparisons, while this is not the case for ssdeep. For instance, the comparison between S01 and S07, which provides a score of 0 in ssdeep, is considered to have a similarity degree of 100% by SiSe. A similar situation occurs when comparing for example S02 and S06, where the results provided by SiSe and ssdeep are 0 and 50, respectively.

Besides, the results provided by SiSe are significantly better adapted to our similarity definition. For instance, when comparing S01 to S03 and S04, it is clear that S03 is almost the same string as S01, whilst S04 only shares with S01 half of its content. However, ssdeep is not able to detect that difference and assigns a value of 50% in both cases. In comparison, SiSe identifies the similarity degree as 100% and 50%, respectively.

#### *6.4. Suitability*

The goal of this test is to compare the running time and the signature size of the four applications when processing the following files:


The previous files were selected in order to cover a wide range of file size values, ranging from less than a megabyte to several gigabytes. Table 26 shows the running time in seconds when computing the signature of those files in a desktop with Ubuntu 18.04 operating system, an Intel i7-4790 processor (Intel Corporation, Santa Clara, CA, USA) at 3.60 GHz, and 16 gigabytes of RAM memory. All the applications have been executed using the Linux command time, and the output has been redirected to a file in all the cases in order not to penalize sdhash, as its signatures are very long and would provoke an important delay if they were to be printed on the screen.


**Table 26.** Signature generation time in seconds with the set of eight files tested.

As it can be observed in Table 26, SiSe is only slightly slower than ssdeep with very small files, and clearly faster than the other applications with medium and large-sized files. It was not possible to process the larger files with LZJD, so it seems that that application is not well-optimized for files of that size.

Regarding the study on the signature length, Table 27 shows the size of the files that store the signatures, where in each signature file only the hashes for that file have been included.


**Table 27.** Signature size in bytes with the set of eight files tested.

It is clear that ssdeep provides the shortest signatures, followed by SiSe. Both applications generate signatures of less than 1 kilobyte. On the other hand, in most of the cases, the size of the signature files of sdhash is several orders of magnitude larger, which could lead to a storage problem when processing volumes with large files. This fact is due to the proportional relationship between the signature length and the input size in sdhash [41], something that does not happen in the case of ssdeep and SiSe. In a test performed by Breitinger and Baier, they found that the signature size when using sdhash is in average 3.3% the size of the original file. Regarding LZJD, the signatures generated by that application always have the same size, but the application was not able to process the three larger files.

Finally, it should also be noted that the complete file path is included in the signatures of SiSe and ssdeep, while sdhash and LZJD only include the file name in their signatures.

#### **7. FRASH Tests**

As mentioned in Reference [4], one of the major difficulties when comparing approximate matching algorithms is the diversity of approaches and files tested by each proposal. With that problem in mind, Breitinger et al. designed and implemented a test framework, called FRASH, that could be used for comparing different proposals.

In order to complement the results offered previously with another set of tests used by other authors, we have included in this section the results obtained with FRASH and the t5 dataset [63], a well-known set of 4457 files (1.8 gigabytes) derived from the GovDocs corpus designed to help test approximate matching algorithms [13].

The FRASH framework is implemented in Ruby 1.9.3 and requires a Linux environment to be run. In our evaluation of SiSe, ssdeep, sdhash, and LZJD we have used the same testing environment as in Section 6 (a desktop with Ubuntu 18.04 operating system, an Intel i7-4790 processor at 3.60 GHz, and 16 gigabytes of RAM memory). Instructions for installing the FRASH framework and compiling the four applications are included in SiSe's GitHub page [17].

FRASH allows us to perform five types of tests: efficiency tests, single-common-block correlation tests, fragment detection tests, alignment robustness tests, and random-noise-resistance tests. Information regarding how to lauch those tests can be found in References [4,17].

FRASH includes the Ruby files associated to ssdeep and sdhash. In order to include SiSe and LZJD in the tests, it is necessary to extend FRASH so that it supports both applications. The necessary files for that task are available in Reference [17]. Those files include the headers used by SiSe and LZJD as well as the methods for identifying the signature strings and retrieving the numerical result of a comparison.

Of all the tests implemented by FRASH, it has only been possible to perform (with the four applications) the tests related to efficiency, alignment, and fragmentation exactly as they are provided. The obfuscation test could only be completed with SiSe, ssdeep, and sdhash, due to the excessive time taken by LZJD (more than 3 days for processing just the first file). In comparison, the single-common-block test could not be completed with any application except ssdeep, as the other algorithms got stuck for days processing the first files of the batch being tested. For that reason, we have included the limited comparison regarding the obfuscation test, but have decided not to include the single-common-block test in our results.

#### *7.1. Efficiency Tests*

This test is composed of three parts suitably called runtime efficiency, fingerprint comparison, and compression [4]. Runtime efficiency measures the time needed by each application for reading the input files and generating the corresponding fingerprints. Those fingerprints are stored by FRASH in a temporary file so they can be used in the fingerprint comparison, which measures the time needed by the algorithms to complete an all-against-all comparison of the fingerprints. Finally, compression measures the ratio between input and output of each algorithm and returns a percentage value.

Table 28 shows the summary of the efficiency test when using all the files of the t5 corpus as input to the test (the screenshot and text file with the results of the test can be found in Reference [17]). This table contains the digest generation and all-pairs comparison time as well as the average digest length.


**Table 28.** Efficiency test results.

As the data included in the table shows, SiSe is the fastest application for digest generation thanks to its double multi-threading capabilities. As expected due to its design (usage of two-character elements, secondary digest larger than the primary digest and lack of digest limits), the average length for the digests generated by SiSe is approximately six times the length of ssdeep digests. It must be noted that FRASH does not compute the size of the digest as the size of the file generated: It removes both the header and the path of the file, and applies a factor of 3/4 to the Base64 string which forms each signature (which is a different approach from the one used in the tests described in Section 6.4, where the total length of the file was taken into account).

Finally, the all-pairs-comparison shows that LZJD clearly outperforms the other algorithms, especially sdhash and SiSe. In the case of SiSe, the fingerprint comparison is slow because of the complex method used in the comparison of digests and its implementation, though it is approximately three times faster than sdhash.

#### *7.2. Alignment Tests*

This test analyses the impact of inserting byte sequences at the beginning of an input by means of fixed and percentage blocks [4]. Regarding the interpretation of the results, in the case of percentual addition, a test associated to the value 100% means that new content of equal size to the original size has been added to the file, so if comparing the amount of the larger file (i.e., the modified file) that contains data included in the smaller file (i.e., the original file), in the case of that specific test the result of the comparison should be around 50 (following our similarity definition). In another example, adding 300% of new content means that the result of the comparison should be roughly 25% using the same definition of similarity.

Table 29 includes an extract of the alignment test performed by FRASH over a random selection of 100 files taken from the t5 corpus (the list of files tested and the screenshots and text file with the results of the test can be found in Reference [17]). In each cell, the first element represents the average score and the second one the number of matches (where the maximum is 100, the number of files tested).


**Table 29.** Alignment robustness test with percentage blocks showing the score (first value) and the number of matches (second value) when testing 100 files taken from the t5 corpus.

While sdhash is able to provide a match in all the cases, its results are located in the narrow band 70–80%, which does not permit to differentiate between the different testing scenarios. In comparison, LZJD is also able to match all the cases, but it produces values significantly lower than expected in all the tests (e.g., in the 100% test, the result is 14.74%). Both SiSe and ssdeep provide results better adapted to those expected according to our definition of similarity, but SiSe clearly outperforms ssdeep in the number of matches.

#### *7.3. Fragment Detection Tests*

Fragment detection identifies the minimum correlation between an input and a fragment (i.e., it determines what is the smallest fragment for which the similarity tool reliably correlates the fragment and the original file). It sequentially cuts a certain percentage (which varies along the test) of the original input length and generates the matching score. This test includes two modes: Random cutting (the framework randomly decides whether to start cutting at the beginning or at the end of an input and then continues randomly) and end side cutting (it only cuts blocks at the end of an input).

Table 30 includes an extract of the fragmentation tests performed by FRASH over the same random selection of 100 files belonging to the t5 corpus (the complete results can be found in Reference [17]). While SiSe and LZJD are able to match all the files, the scores of LZJD deviate significantly more from what is expected in several tests. In comparison to SiSe, both ssdeep and sdhash provide higher outputs, which from our point of view makes SiSe the best option for providing a value according to our similarity definition.


**Table 30.** Fragment detection test with two cutting modes (random start/end only) and number of matches in each case.

#### *7.4. Single-Common-Block and Obfuscation Tests*

Unfortunately, it has not been possible to complete the single common block and obfuscation tests in their original form with the four applications. Due to the design of the tests and the characteristics of the four algorithms evaluated, FRASH gets in a loop when processing most of the files used in the previous sections.

Both tests evaluate the algorithms for each of the files entered as an argument and change the files so the score gradually decreases until it reaches the value 0 (which is the halting condition). However, even with files whose size is less than 1 megabyte, LZJD gets active in the loop for several days with just one file, and the same happens (though to a lesser extent) with sdhash and SiSe. The result is that, after several days, there was no progress in the tests.

As an alternative, we were able to perform the obfuscation test with SiSe, ssdeep, and sdhash using a selection of the seven files (each one of a different file type) with minimum size from the t5 batch. The obfuscation test tries to determine what is the maximum number of changes if the match score is requested not to be lower than a certain value. This allows an estimation of how many bytes need to be changed all over the input to receive a non-match. The results of that test are offered in Table 31.


**Table 31.** Obfuscation resistance test showing the score (first value) and the number of matches (second value) when testing 7 files.

The results of this test show that sdhash is the most resistant of the three algorithms. While SiSe and ssdeep offer similar values for the first three cases displayed in Table 31, ssdeep does not produce any output for the rest of the cases and SiSe is able to provide a match in all the cases.

#### **8. Conclusions**

In this contribution, a signature generation procedure based on ssdeep and a new algorithm for comparing two file signatures have been presented. We have implemented our tool, SiSe, to fulfil the requirements that any bytewise approximate matching application should satisfy, as indicated in Section 4.1:



In the implementation of SiSe we have limited the maximum length to 2560 double characters for performance reasons, though given that the algorithm computes the initial block size with the goal to obtain a leading signature of 64 double characters, the number of files potentially affected by that limitation is practicable negligible.

Our algorithm, which can be applied to forensic operations in different types of digital media, shows results appropriately adapted to the similarity definition that we have used along this contribution. Our implementation is only slower than ssdeep when managing very small files, and faster than the other three algorithms in the rest of cases, specially when computing the signature of large files. Additionally, the signatures generated by our algorithm are much smaller than the ones of sdhash and, to a lesser extent, LZJD, making it more suitable for processing sets of large files.

From a practical standpoint, SiSe is able to compare files without imposing any limitation to their respective size, which consequently allows us to use it in many different situations (e.g., detecting plagiarism, searching for a list of items inside a much larger document, looking for malware hidden in executable files, etc.). Besides, it is able to detect the resemblance in some special cases where the other algorithms considered in this contribution fail.

In our opinion, according to the previously mentioned features, our proposal is a good alternative to the most common tools used nowadays (i.e., ssdeep, sdhash, and LZJD) for evaluating files, producing a clear value representing the similarity degree, specially when the goal is to obtain accurate information about the percentage of a file included in another file.

**Author Contributions:** Elaboration, validation, reviewing, edition, and resource allocation, V.G.M., F.H.-Á., L.H.E. Funding acquisition, V.G.M., L.H.E. All authors have read and agreed on the published version of the manuscript.

**Funding:** This work was supported in part by the Ministerio de Economía, Industria y Competitividad (MINECO), in part by the Agencia Estatal de Investigación (AEI), in part by the Fondo Europeo de Desarrollo Regional (FEDER, UE) under Project COPCIS, Grant TIN2017-84844-C2-1-R, and in part by the Comunidad de Madrid (Spain) under Project reference P2018/TCS-4566-CM (CYNAMON), also cofunded by European Union FEDER funds. Víctor Gayoso Martínez would like to thank CSIC Project CASP2/201850E114 for its support.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


TLSH Trend Micro Locality Sensitive Hash

#### **References**





© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
