1. Introduction
Third-party libraries (TPLs) have been widely adopted to boost production efficiency and reduce artificial costs during software development.
However, TPL reuse may introduce vulnerable or malicious code and expose the software, which reuses them to potential risks. According to the report of Sonatype [
1], open source vulnerabilities are prevalent among popular projects, 29% of which contain at least one known vulnerability. For example, the remote code execution vulnerability (CVE-2021-30116) was found in the Kaseya VSA. The REvil hacker group used this vulnerability to deploy the ransomware in local clients and launch a large-scale extortion attack against related companies [
2]. In addition, attackers no longer wait for the public disclosure of vulnerabilities to exploit but actively inject new vulnerabilities into open source projects that support the global supply chain. For example, the popular NPM UA-Parser-JS library was hijacked and planted with malicious code by hackers, resulting in numerous Windows and Linux devices infected with cryptocurrency mining software and password-stealing trojans [
3]. In addition, this library is part of the supply chain for many well-known software and enterprises. It will therefore be used in user-facing end-use applications, which will spread to millions of computers. Generally, researchers have no access to the program source code containing the TPLs and are limited to analyzing the binary executables. Therefore, there is an urgent need to discover the TLPs in binary executables.
Researchers have recently proposed different methods to detect TPLs [
4,
5,
6,
7,
8,
9,
10,
11]. For example, OSSPolice [
5] extracts both string literals and exported function names to detect TPLs reused by the target binary program, and B2SFinder [
7] extracts seven kinds of code features to carry out TPL detection. ModX [
11] extracts string literals, function call graph (FCG), and functions accessing the same data to perform TPL detection.
However, we identify three major limitations of the existing TPL detection. First, at present, the third-party detection methods mainly rely on syntactic features (function names [
5,
6], strings [
4], constants [
7] etc.). On the one hand, some methods need function name information, which can be stripped away in binaries. On the other hand, with the increase of the third-party component database, the syntactic characters (e.g., strings and constants) increase, and the uniqueness and validity of the characters decrease, resulting in low accuracy of recognition. Moreover, obfuscation techniques (such as string encryption) can discount the detection performance for syntactic-based methods. Second, in addition to syntactic features, a few studies (such as ModX [
11]) focus on the function call graph (FCG) and some functions’ semantic features (accessing the same data) to detect third-party libraries, but the comparison of FCG causes a lot of time consumption and the functions selected for detection are based on syntactic features, causing performance degradation when the syntactic features are obfuscated. Third, most existing TPL detection methods [
4,
5,
6,
7,
9,
10] focus on the source code form of libraries. However, many third-party libraries have difficulty downloading their source code from the Internet (closed source libraries). In addition, some code in the source code will not be compiled into the binary executable file, such as the “text” part of the source code; thus, features extracted from the uncompiled source code will cause feature interference.
To tackle the above problems, we propose and implement a more precise and scalable TPL detection method, named BBDetector, which focuses on the binary form of third-party libraries. To solve the problem of TPL detection considering only syntactic features, we also consider the rich semantic features of functions contained in the libraries for more accurate matching. Moreover, in terms of efficiency, we do not consider the expensive FCG matching but consider all the function semantics. However, the function semantic matching may introduce the problem of large time consumption with the expansion of third-party libraries. To deal with fast and scalable function semantic matching, we first learn the rich semantic information of functions using the presentation learning model [
12] and form a feature vector for each function. Then, we design a scalable function vector similarity search method, which can support fast vector similarity retrieval when facing a large-scale function vector database. Moreover, we use the function vector similarity search method to identify function pairs (i.e., anchor functions) that are highly similar and likely to have the same functionalities and the candidate libraries. Then, we carry out TPL detection based on these anchor functions and the candidate libraries.
To demonstrate the efficacy of BBDetector in TPL detection, we collect two types of datasets: Dataset I and Dataset II. Dataset I is a binary target program dataset with known TPL reused relationships, which includes programs built by the nix [
13] package manager and a set of manually building binaries on Ubuntu 20.04. Dataset II is a TPL dataset, which is used to build third-party library databases. In addition, we use effectiveness (including precision, recall, and F1-score), efficiency, and code obfuscation-resilient capability as the evaluation metrics. We compare BBDetector with the state-of-the-art binary-to-binary TPL detection method ModX [
11] and binary-to-source TPL detection method B2SFinder [
7]. The experiment results show that BBDetector outperforms B2SFinder and ModX in terms of effectiveness, efficiency, and obfuscation-resilient capability. For the nix binaries, the F1-score of BBDetector is 1.11% and 11.21% higher than that of ModX and B2SFinder, respectively. In addition, for the Ubuntu binaries, the F1-score of BBDetector is 1.32% and 14.93% higher than that of ModX and B2SFinder, respectively. Moreover, in terms of efficiency, the detection time of BBDetector is only 30.02% of ModX. Furthermore, for the obfuscation-resilient capability, BBDetector is much stronger than B2SFinder. BBDetector achieves an F1-score of 71%, slightly lower than that of 77% achieved with the non-obfuscated binary programs. However, B2SFinder only achieves an F1-score of 28%, much lower than that of 67% achieved with the non-obfuscated binary programs.
In summary, our contributions are summarized as follows: Current TPL detection methods mainly rely on syntactic features, which results in low recognition accuracy and significantly discounted detection performance by obfuscation techniques. A few semantic-based approaches (such as FCG graph) face the efficiency problem. We propose a more precise and scalable TPL detection method to resolve these problems. To improve accuracy, in addition to syntactic features, we also consider the rich function-level semantic features and form a feature vector for each function. Moreover, in terms of efficiency, we design a scalable function vector similarity search method to identify anchor functions and the candidate libraries. Then, we perform TPL detection based on these anchor functions and the candidate libraries. We implement and evaluate BBDetector with binary target programs with the known TPL reused relationships and TPL dataset. Moreover, the experiment results show that BBDetector outperforms B2SFinder and ModX in terms of three metrics: effectiveness, efficiency, and obfuscation-resilient capability.