3.2.2. Approximate Matching

Approximate matching is an exciting new technique for determining the similarities between two digital objects. Numerous approximation matching techniques developed to address contemporary issues in digital forensics are essentially based on the capacity to describe objects as sets of characteristics, which simplifies the similarity problem by limiting it to the well-defined domain of set operations [25]. There are eight well-known approximation matching algorithms, including the following ssdeep, sdhash, mrsh-v2, bbHash, mvHash-B, SimHash, saHash, and TLSH. While the first three algorithms remain expanded and relevant, the last four algorithms are less promising in terms of digital forensics for a variety of reasons, including recall and accuracy rates, runtime efficiency, and detection capabilities. For cross-correlation, the final method (TLSH) is less powerful than sdhash and mrsh-v2 [25]. While ssdeep is the most well-known CTPH use today, another method, Multi-Resolution Similarity Hashing, version 2 (MRSH-V2), has been suggested based on the same principles or enhancements to the original ssdeep algorithm [26].

In ssdeep, the system computes the similarity of two files based on their signatures throughout the comparison process [26]. Ssdeep analyzes two strings and calculates the least number of operations required to convert one string into the other using an edit distance method based on Levenshtein distance. While ssdeep is very efficient at detecting similarities between text files, it has a poor detection rate for images due to the possibility of an active adversary exploiting it [23]. In comparison, Sdhash (Similarity Digest hash) encodes the output hash features with a low empirical probability using Bloom Filters. Its results are based on a "similarity score calculated by computing the normalized entropy of the digests, which runs from 0 to 100, with 0 being a mismatch and 100 representing a perfect or near match. The sole drawback discovered for sdhash was its execution time [23].

Mrsh-v2 overcomes ssdeep's limitations and becomes quicker than sdhash [25]. The main objective of MRSH-v2 is to compress and produce a similarity digest for every byte sequence. Similarity digests are constructed in such a manner that they may be compared to one another, generating a similarity score. Each digest of similarity is composed of Bloom filters. To generate the similarity digest, MRSH-v2 divides the input into roughly 160-byte pieces (sub hashes). These chunks are hashed using FNV (Fowler/Noll/Vo) Fast non-cryptographic hash function to establish the Bloom filter's five bits. To chunk the input, it employs a seven-byte window that glides across it byte by byte. Approximate matching is accomplished by comparing similarity digests. A pairwise comparison of two file sets is needed to compare them [27,28].

The root node of a hierarchical Bloom filter tree is a Bloom filter that represents the whole collection. When searching for an image, if a match is discovered at the root of the tree, all of the tree's child nodes may be searched. The method of determining if a file matches a Bloom filter node is identical to that of adding a file to the tree. Rather than putting each hash into the node, the sub hashes are compared to the Bloom filter to see whether they are included inside it. If a node has a certain number of consecutive hashes, it is considered a match [27].
