Searching Open-Source Vulnerability Function Based on Software Modularization

Guo, Xixi; Cai, Ruijie; Yin, Xiaokang; Shao, Wenqiang; Liu, Shengli

doi:10.3390/app13020701

Open AccessArticle

Searching Open-Source Vulnerability Function Based on Software Modularization

by

Xixi Guo

,

Ruijie Cai

^*,

Xiaokang Yin

,

Wenqiang Shao

and

Shengli Liu

State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(2), 701; https://doi.org/10.3390/app13020701

Submission received: 1 December 2022 / Revised: 28 December 2022 / Accepted: 30 December 2022 / Published: 4 January 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Vulnerable open-source component reuse can lead to security problems. At present, open-source component detection for binary programs can only reveal whether open-source components with vulnerabilities are reused, which cannot determine the specific location of vulnerabilities. To address this problem, we propose BMVul, an open-source vulnerability function detection based on the software modularization method, which is oriented to binary programs. BMVul performs binary modularization by the overlapping clustering method DBM based on directed graph, then uses feature comparison technology to carry out modular software component analysis. After creating open-source component vulnerability function set through function signature, BMVul detects vulnerability function in the binary modules reusing open-source components. The experimental results show that compared with the component detection based on Louvain modularization and B2SFinder, BMVul improves the precision by 3.16% and 59.57%, respectively. Moreover, the precision of unique binary module matching is improved by 39.43% compared with the Louvain method. The F1 score is improved by 8.45% compared to B2SFinder. Module-level detection narrows the search space of vulnerability functions, thereby reducing the workload of open-source vulnerability detection, which is of great significance for software security analysis.

Keywords:

open-source vulnerability; binary-to-source matching; program modularization; community detection; software security

1. Introduction

Open-source code reuse can speed up program development and programs are applying increasingly open-source components. Studies [1,2,3,4] indicated that commercial software also reuses a large number of open-source components, such as libraries in the firmware of the Internet of things and the Linux kernel. However, the reuse of vulnerable open-source components will cause security issues. For example, Heartbleed [5] (CVE-2014-0160) was discovered in version 1.0.1 of OpenSSL [6], which affected millions of software (e.g., LibreOffice [7], VMware [8]) and devices (e.g., routers, switches, firewalls). VULDEFF [9] proposes a vulnerability-detection method based on function fingerprints and code differences. VGRAPH [10] designs an accurate approximate matching algorithm which is capable of detecting modified vulnerable code clones, and differentiating them from their patched counterparts. Similarly, VELVET [11] and Bgnn4vd [12] find vulnerable code reuse in source code.

To detect the reuse of open-source components, researchers proposed a series of methods based on binary-to-source matching. OSSPolice [1] identifies open-source components through strings and exported function names and detects whether the components have certain types of vulnerabilities, to determine whether components reused in the program are vulnerable versions. B2SFinder [2] matches string, integer type, and control flow characteristics with different weight algorithms to detect the reuse of open-source components in commercial software, and analyze potential vulnerabilities of components reused. B2SMatcher [3] uses program-level features for rough matching to determine single-version and multiversion reuse, then uses function-level features for exact matching. Some applications have been found to reuse vulnerable versions of open-source components. Nevertheless, the above methods are limited to only providing module-level information (e.g., reused vulnerable versions of open-source components), without giving the specific location of the vulnerable function in binary.

Usually, it is hard to detect vulnerability in binary executables. Consequently, few methods detect reused binary vulnerability functions. With the source code of the vulnerable open-source components, researchers [13,14] can extract source-level information in open-source components to assist vulnerability detection in binary executables. According to the features of vulnerable functions in open-source components, they search for the reused vulnerable binary functions. However, all binary functions need to be compared with source vulnerability functions to determine the binary vulnerability functions, which is time consuming.

To solve the above problems, we propose an open-source vulnerability detection method based on software modularization, named the binary modularization-based vulnerability (BMVul) function. First, we extract the directed call graph of binary functions. Then, we cluster functions by overlapping community detection technology based on statistical significance OSLOM [15] that integrates modularity and information theory algorithms. Based on modularity, we carry out feature-based software component analysis. According to the modular software component analysis results, we detect vulnerability types and specific vulnerability functions in open-source components through function signature, then search the corresponding binary vulnerability functions in the modules reusing components.

In order to evaluate the effectiveness of BMVul, we collect the dataset from ISRD [16] and ModX [17]. The experimental results show that BMVul significantly outperforms existing methods. The precision of BMVul increases by 3.16% and 59.57% than component detection based on Louvain [18] and B2SFinder, respectively. Moreover, the precision of matching a unique binary module is improved by 39.43% compared to Louvain detection. The F1 score increases by 8.45% over B2SFinder. Module-level function matching greatly reduces the workload of open-source vulnerability function detection and finds vulnerability functions reused in the binary program.

Overall, this paper makes the following major contributions.

We propose a binary function clustering method for directed graphs, named binary modularization based on directed graph (DBM), which divides binary modules based on overlapping community detection ideas of OSLOM. Then we carry out modular software component analysis through matching features between binary modules and source code.
We accurately locate specific source vulnerability functions in open-source components through function signature technology, then match vulnerability functions in binary modules reusing components, which narrows the scope of open-source vulnerability detection.
We implement the open-source vulnerability detection prototype BMVul. The results show that BMVul is superior to the current detection method, which can detect open-source vulnerability functions in binary modules and be of great significance to software security work.

The rest of this paper is organized as follows. Section 2 describes the research status in related fields. We introduce the proposed model BMVul in Section 3. In Section 4, we evaluate the performance of BMVul in open-source component detection and vulnerability function detection compared with the state of the art work. It is concluded in Section 5.

2. Related Work

2.1. Software Modularization

Software modularization is the clustering of software entities, such as classes, modules, and files, which divides entities with similar characteristics or the same functionality into the same module (cluster). Modules have the characteristics of high cohesion and low coupling, which is convenient for the research and analysis of software structure and functionality realization. Software information recovery, software reconstruction, and software component identification are the top three applications of software modularization. Table 1 summarizes these methods.

2.1.1. Software Information Recovery

Mohammadi et al. [19] propose a neighborhood tree algorithm, which creates a tree based on available neighbors in the dependency graph, then clusters nodes. This method is superior to the algorithm based on search and stratification because of its high stability. ATLBO [20] solves the clustering problem of software modules by fuzzy adaptive fuzzy teaching learning based on optimization (ATLBO). Because this method adaptively selects search operators, it has better performance. Sun et al. [21] propose a software module clustering algorithm based on probability selection to establish a network model for the software system structure, which transforms the software module clustering into graph clustering. Furthermore, it increases the search space through multipath and iterative operations. There are good clustering effects, convergence speed, and stability.

2.1.2. Software Restructuring

Hatami et al. [22] divide dependent modules into the same cluster through the ant colony optimization (ACO) algorithm, which has better stability and a higher convergence effect. Similarly, Varghese et al. [23] traverse the software structure based on the ACO algorithm to modularize the software.

2.1.3. Software Component Identification

Psarras et al. [24] use a semantic clustering algorithm optimized based on extracting topic purity scores, which reduces the developer’s need for complex parameter allocation. A postprocessing technique is used to incorporate extracted topics into classes to help developers make decisions. Third-party library reuse has become crucial in modern software development. Therefore, LibCUP [25] implements automated detection for multilayer libraries by using a mode based on a variant of the standard clustering algorithm.

2.2. Open-Source Vulnerability Detection

Open-source components are the core components of software systems. Software systems may occur security vulnerabilities because of relying on open-source components. Software weaknesses are hidden in plain sight and are easily attached by attackers. The number of open-source vulnerabilities discovered and disclosed is growing with increasing open-source components reuse. The research objects are mainly Android applications and commercial software in binary form.

Brahmastra [26] is an application automation tool designed to help security researchers test third-party components in mobile applications at run time. Brahmastra is a passive approach that uses dynamic drivers to trigger bug code given a particular vulnerable version of an open-source component. OSSPolice [1] is a fully automated tool that detects third-party component cloning for Android applications. It can quickly analyze application binaries and introduces a novel layered indexing scheme to identify potential software licensing violations and the use of known vulnerable component versions. LibScout [27] is a library detection technique that resizes code obfuscation and can pinpoint the library version used in the application. This approach is the first to quantify the security impact of third-party libraries on the Android ecosystem. B2SMatcher [3] is a fine-grained version identification of open-source software in binary file method. It uses program-level characteristics for rough matching to identify reuse types, namely single-version reuse and multiversion reuse, then uses function-level features for exact matching. It extracts source code characteristics from compile-related source files by machine learning methods such as K-means clustering and decision tree. B2SMatcher finds some popular applications, such as Zoom and TeamViewer, reuse vulnerable open-source component versions. FOSSIL [28] is a resilient and efficient system that combines Bayesian networks with greater resilience to code obfuscation and can find open-source packages in malware binaries that match those listed in security and reverse-engineering reports. FIBER [13] proposes a fine-grained patch presence test method, which promotes the similarity-based bug search to a new level. However, FIBER doesn’t take into account open-source component variants, which may require compiling and searching a large number of binaries. The specific open-source vulnerability detection are shown in Table 2.

In general, developers use outdated open-source components and are less aware of potential risks. Therefore, the study of open-source vulnerability can provide practical insights for the sustainable improvement of software systems.

Therefore, we propose a directed overlapping modularization method for binary code. Up until now, software module clustering is mainly for source code, and binary modularization is less. BCD [29] divides binary modules by FN [30], a fast community detection method based on the greedy algorithm, and achieves optimal detection results by measuring modularity. ModX [17] clusters binary functions by the Louvain algorithm based on modularity. There are assumptions in this work; that is, there are not overlapping binary modules. However, a function will be called by different functions, so there is the case that a function belongs to more than one module, and the Louvain algorithm only works for undirected graphs. In addition, matching potential source vulnerability functions set in a binary module reduces the search task scope in software security analysis compared with vulnerability function detection in the whole binary program.

3. Methodology

BMVul is a method that detects open-source vulnerability functions based on software modularization. We divide BMVul into the following two stages (Figure 1):

Software component analysis based on program modularization. In the software component analysis stage, we extract the directed function call information of the binary program and represent the information in the form of a graph. Then, we cluster functions based on the overlapping community detection algorithm OSLOM. Next, we extract string-type features and function complex branch sequences of binary modules and open-source components to match each module of a binary program with open-source components. It identifies the corresponding relationship between binary modules and open-source components and narrows the positioning range of components.
Open-source vulnerability detection. In the open-source vulnerability detection stage, we create an open-source component vulnerability function set through function signature, which uses function hash and code normalization techniques. Then we search vulnerability functions in the binary module that reuses open-source components.

3.1. Software Component Analysis Based on Program Modularization

3.1.1. Program Modularization Method Based on Directed Graph

Software modularization aims at functions clustering. It performs a cluster analysis on the set of functions so that functions defined in proximity to one another and functions that frequently call one another will belong to the same cluster.

Modularization refers to the process of dividing the program into relatively self-contained components or modules. Typically, each module encapsulates a fundamental set of related functionalities. However, the modularization will be broken during the compilation process because the compiler will merge all functions into one binary file. Faced with such a good deal of functions, it is not convenient for binary code analysis. Therefore, we propose a binary code modularization method to narrow the search task scope in software security analysis.

The current binary modularization work divides modules for undirected graphs without considering overlapping communities. Therefore, given these limitations, we propose an OSLOM-based DBM for directed graphs and overlapping modules. Unlike most binary analysis work, which only requires local analysis, binary program modularization is a global understanding of the program. Therefore, we extract the function call graph (FCG), which is defined as Equation (1),

G = (V, E, W)

(1)

E = \{(a, b) | a, b \in V\}

(2)

W = \{W_{a b} | a, b \in V\},

(3)

where V is the set of all functions, E represents the set of calling edges between all functions, which is defined as Equation (2), pointing from the caller to the callee, and W represents the set of edge weights, which is defined as Equation (3). The more call times between functions, the greater the probability that they complete the same functionality, that is, they belong to the same module.

The directed edge call weight from a to b expresses call times from function a to function b, which is defined as Equation (4),

W_{a b} = \{\begin{matrix} n_{a b}, & i f (a, b) \in E_{a b} \\ 0, & o t h e r w i s e, \end{matrix}

(4)

where

W_{a b}

is the edge weight from function a to function b,

n_{a b}

is function calls times from a to b, and

E_{a b}

is the set of calling edges from a to b.

We take the directed function call graph as the input information of module partition and use the DBM method based on the OSLOM idea to divide binary modules. OSLOM algorithm takes significance [31] as a measure to evaluate the clusters (module), which is defined as the probability of finding the cluster in a random null model; that is, in a class of graphs without community structure, the same as the empty model used in modularity optimization [32], indicating the possibility of the community emerging in a randomized network.

The realization of the OSLOM-based DBM method is divided into three steps, as shown in Figure 2.

Step 1: We use a significance score to detect important modules until they converge. The initial node is a single function. We calculate the probability of adding adjacent nodes to the node, then delete unimportant nodes. We set the convergence threshold to 0.1, which can achieve the best performance.

Step 2: Based on the set of modules in Step 1, we detect the internal structure of modules or possible mergers between modules to find the minimum clustering result.

Step 3: Detecting the hierarchical structure of modules. The above steps form fundamental function clustering results, in which each module becomes a new node. If there are edges between two nodes, a new edge is formed between them, and the edge weight is the sum of the edge weights between them. A new supernetwork emerges again, by that analogy, until the process no longer produces new modules.

The implementation process of OSLOM can integrate various community detection technologies, such as the heuristic method Infomap [33] based on the random walk, the overlapping community detection method Copra [34] based on label propagation, and the Louvain algorithm based on the concept of modularity. The module output of one or more of the above algorithms can be used as input information for DBM to perform subsequent module partitions. The more algorithms the process integrates, the better the final module partitioning.

3.1.2. Software Component Identification

Binary software component identification matches binary code with source code to determine whether the binary reuses open-source components. At present, there are two kinds of comparison ways. One is to detect the similarity between binary code and source code directly. The other is to determine the compilation provenance of the binary program (e.g., optimization level, architecture, compiler), then compiles the source code into binary form and convert it into binary similarity comparison work. However, there are various combinations of compilation configurations in the implementation of the latter, so it is difficult to detect the compilation provenance accurately. In addition, it is hard to implement because the success rate of automatic compilation is low. Therefore, we carry out a feature-based comparison between binary and source code, considering the principle that features exist in binary and source code and are not easily affected by compilation optimization. We select string-type features (e.g., strings, exported function) and complex branching sequences in functions (e.g., if/else, switch/case).

In the phase of similarity detection between a binary module and source code, for string-type features, features are equivalent when the binary module feature is the same as the source code. We match the if/else features by the length of the longest common subsequence, which is equivalent when the size of the longest common subsequence exceeds the threshold. For switch/case, we compare the switch/case in the module with the switch/case unordered list with the default branch in the source code. The thresholds involved in feature matching are determined empirically. When the matching score of features exceeds the corresponding threshold, it is determined that the module reuses the component.

3.2. Open-Source Vulnerability Detection

3.2.1. Open-Source Component Vulnerability Function Detection

Source code preprocessing based on normalization. Most software developers reuse open-source components with code or structural changes, and open-source components are constantly updated to provide better functionality. However, internal and external open-source component changes can lead to syntactic diversity of vulnerable code. We can address the syntactic diversity problem of vulnerability code clones by using the signature database in Movery [35]. Consequently, this paper uses the vulnerability signature database to perform vulnerability function detection for open-source components.

The signature database is generated by the key techniques of function collision and core code line extraction, including vulnerability signature and patch signature. During the generation of the signature database, essential and dependent vulnerable code lines are extracted to generate extensible vulnerability signatures for addressing syntax diversity caused by internal open-source component modifications. Then, critical lines of code, dependent lines of code, and control flow lines of code are extracted from vulnerability and patch functions to address the syntax diversity caused by external open-source component changes. Finally, the vulnerability signature and patch signature are generated based on extracted contents.

Before vulnerability detection, we generate the signature for the target component through function hash and code normalization. Hash values and the path information of functions will be stored in the function hash file. Then we remove white spaces and comments and convert upper-case characters to lower-case to parse all function code lines. Unnormalized and normalized code line forms are stored in the function signature.

Vulnerability function detection. We compare the target open-source component signature with the vulnerability signature database to detect the vulnerability code clone in the component to determine the vulnerability function. As is shown in Equation (5), when all codes of the target function are included in the vulnerability signature, we calculate the similarity of syntax between the target function and the function in the vulnerability signature through the Jaccard similarity coefficient. If it reaches the threshold (0.5), the target function code is the vulnerability code clone. We have

s i m (f, f_{v}) = \frac{| f \cap f_{v} |}{| f \cup f_{v} |} .

(5)

3.2.2. Binary Vulnerability Function Detection

According to source vulnerability functions, we search the binary vulnerability functions in the binary module reusing open-source components by the binaryAI [36] engine. Under normal circumstances, the source code functions need to be compared with all binary functions of the program. Based on software component analysis based on binary program modularization, we only need to match the vulnerability function in the specific binary module, which greatly reduces the analysis range of the binary vulnerability function detection task.

The implementation principle of the binaryAI engine is mainly to embed the immediate number, string, symbol, pseudocode, and control flow graph, and obtain the matching function through similarity search. Because it is to detect the vulnerability functions in the binary module reusing open-source components which contain vulnerability functions, we don’t use the public function set provided by the binaryAI engine for matching. Before the function search, based on the engine, we create the vulnerability function set of open-source components reused by binary programs, including the function code, source file path, function features obtained by the feature extraction library, and other relevant information. We can obtain binary vulnerability functions matching source vulnerability function set through vector similarity comparison.

4. Results and Analysis

In this section, we evaluate the effectiveness of BMVul in open-source component detection. in open-source component detection. In addition, we carry out vulnerability function detection for the binary module that reuses open-source components.

4.1. Datasets

We collect two datasets (dataset I and dataset II) for the evaluation of BMVul.

Dataset I. The ground-truth binary programs obtained according to source file analysis and the partial dataset from ISRD [16] and ModX [17] are shown in Table 3. We obtain the source code of components from GitHub.

Dataset II. We collect the top 10 frequently reused components involved in B2SFinder from Github to obtain the source code of each component in the past three years. The description of the top 10 components is shown in Table 4 (the order in the table does not represent the ranking of reuse frequency).

4.2. Compared Approachs

We compare BMVul with B2SFinder [2] and Louvain detection [18] (component detection based on the Louvain algorithm). Our prototype BMVul carries out program modularization by DBM, which is a direct and overlapping binary function clustering technology integrating modularity and information theory algorithms. Then, we perform feature-based software component analysis for binary modules. According to the modular software component analysis results, we detect specific vulnerability functions in the binary modules reusing components. Louvain detection is a Louvain-based component-identification method. It differs from BMVul in the modularization phase. Louvain detection only uses modularity algorithm to divide binary modules without considering the directivity of function call graph and overlapping clustering. B2SFinder is a file-level feature-based binary and source code comparison method. It does not perform binary program modularization and only compares the entire binary program with open-source components.

4.3. Evaluation Metrics

In experiments, we select precision and F1 score as evaluation metrics. The definitions are shown as Equations (6) and (7):

P = \frac{T P}{T P + F P}

(6)

F_{1} = \frac{2 * P * R}{P + R} .

(7)

4.4. Effectiveness of Component Detection

We evaluate the effectiveness on Dataset I to compare BMVul with B2SFinder and Louvain detection. Table 5 reports the performance of BMVul in terms of efficacy in detecting open-source components.

P represents the precision of identifying components in binary modules, and

P_{1 m}

represents the precision of identifying components in a unique binary module. As can be seen from the table, the precision of B2SFinder is only 47%, and Louvain detection is 72.7%. However, BMVul reaches 75%, which is significantly improved by 3.16% and 59.57% than the above two methods, respectively. There are more false positives in B2SFinder, which causes a high false positive rate. The detection based on software modularization can decrease the number of false positive cases to reduce the false positive rate. It can be found that BMVul increases by about 39.43% compared with Louvain detection by analyzing the changes of

P_{1 m}

values, because BMVul divides software modules based on DBM, considering the directed property of function call graph, which is different from the Louvain algorithm. The Louvain algorithm can only divide a function into a module without considering directed property and the division of overlapping modules, thus reducing the module-based component detection precision. Therefore, given the situation that some functions may belong to multiple modules at the same time, we apply overlapping detection for function clustering to improve the

P_{1 m}

and P. The F1 score of BMVul achieves 56.5%, which outperforms B2SFinder by 8.45%, so it is better than the current file-level component detection.

4.5. Evaluation of Matching Unique Module

We express the precision that a reused component matches a unique module as

T P_{1 m}

.

T P_{1 m}

of BMVul and Louvain detection for each binary program is shown in Figure 3.

As can be seen from the figure,

T P_{1 m}

of BMVul are above or equal to Louvan detection. Among them, there are five

T P_{1 m}

exceed Louvain detection. The ratio between

T P_{1 m}

and

T P

results is expressed as

1 M_{R a t i o}

. Which is defined as Equation (8):

1 M_{R a t i o} = \frac{T P_{1 m}}{T P} .

(8)

The results of

1 M_{R a t i o}

comparison are shown in Table 6. As can be seen from the table,

1 M_{R a t i o}

of BMVul reaches 87.5%, which is improved significantly by 31.18% over Louvain detection. Therefore, BMVul performs clustering well through overlapping detection of directed call graphs for functions. Functions that realize the same functionality cooperatively are clustered into the same module, which improves the effect of accurately matching between binary modules and reused components.

4.6. Open-Source Vulnerability Function Detection and Analysis of Binary Modules

Open-source components reuse may introduce security vulnerabilities into the binary program. The vulnerability function detection results for the top 10 frequently reused components in dataset II are shown in Figure 4.

As seen from the figure, we find that only Zlib, Libpng, and unrar don’t contain vulnerabilities in the last three years (Libtiff: 2015–2017). The number of vulnerability functions is relatively high in FreeType and SQLite, reaching 19 and 11, respectively. In software development, once these component codes containing a large number of vulnerability functions are reused, the program will face potential security risks.

Therefore, it is necessary to detect vulnerabilities caused by the reuse of vulnerable components. We detect vulnerability functions of binary programs reusing vulnerable components on Dataset I. The vulnerability function results are shown in Table 7. The fourth and fifth columns are the source code and binary vulnerability functions detected by BVMul. A “✓” means the vulnerability function is reused in the binary.

By analyzing the results in the table, based on the software component analysis, we search vulnerability functions in the binary modules that reuse components containing vulnerabilities. We find specific reused vulnerability functions in OpenVPN, Lzbench, and Redis-server, which are no longer limited to detecting potential vulnerabilities. In addition, compared with the detection of the whole binary program file, vulnerability function detection in the module reduces the amount of function matching and avoids comparison analysis of binary function unused components.

5. Conclusions

In this paper, we propose BMVul, a binary modularization-based open-source vulnerability function detection method. BMVul performs binary module-level open-source component identification and then detects vulnerability functions in the located binary module. The experiment results show that BMVul outperforms the state-of-the-art methods B2SFinder and Louvain detection in effectiveness and performs well in binary vulnerability function detection. The precision of BMVul outperforms B2SFinder by 59.57%. Moreover, compared with Louvain detection, the precision of matching a unique binary module increases by 39.43%. At present, most detection methods are limited to the file granularity, which is time-consuming for vulnerability function searching. However, we can find reused components at module granularity. With the achieved accuracy, we can find correct binary module reused components. In vulnerability function analysis stage, instead of doing a lot of global analysis, we only need to analyze specific binary modules. Additionally, binary modularization-based detection greatly reduces the search scope of open-source vulnerability functions in binary programs, which is of great significance to software security analysis. However, there may be other properties that can be created to make the modularization results better. Moreover, we can try to find other features that are more suitable for binary-source comparison to get more accurate results. Although the current technology has not reached the ideal result, it will certainly be improved in the future with more in-depth research.

Author Contributions

Conceptualization, X.G.; data curation, R.C. and W.S.; methodology, X.G.; software, X.G. and S.L.; formal analysis, X.Y.; writing—original draft preparation, X.G.; writing—review and editing, X.G. and X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Foundation Strengthening Key Project of the Science & Technology Commission (2019-JCJQ-ZD-113).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The calculated data presented in this work are available from the corresponding authors upon reasonable request.

Acknowledgments

The author would like to thank the anonymous reviewers for their valuable comments on our paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Ruian, D.; Ashish, B.; Meng, X.; Taesoo, K.; Wenke, L. Identifying open-source license violation and 1-day 421 security risk at large scale. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS’17, Dallas, TX, USA, 30 October–3 November 2017. [Google Scholar]
Yuan, Z.; Xu, J.; Piao, A.; Xue, J.; Huo, W.; Feng, M.; Li, F.; Ban, G.; Xiao, Y.; Wang, S.; et al. B2SFinder: Detecting open-source software reuse in COTS software. In Proceedings of the 2019 34th IEEE/ACM International Conference on Automated Software Engineering, San Diego, CA, USA, 11–15 November 2019. [Google Scholar]
Ban, G.; Xu, L.; Xiao, Y.; Li, X.; Yuan, Z.; Huo, W. B2SMatcher: Fine-Grained version identification of open-Source software in binary files. Cybersecurity 2021, 4, 21. [Google Scholar] [CrossRef]
Hemel, A.; Kalleberg, K.T.; Vermaas, R.; Dolstra, E. BAT Finding software license violations through binary code clone detection. In Proceedings of the 33rd International Conference on Software Engineering, Honolulu, HI, USA, 21–22 May 2011. [Google Scholar]
Heartbleed. Available online: https://en.wikipedia.org/wiki/Heartbleed (accessed on 24 November 2022).
OpenSSL. Version 1.0.1, OpenSSL Technical Committee, Canada. Available online: https://www.openssl.org/ (accessed on 24 November 2022).
Libreoffice. Version 4.2.0, The Document Foundation, Germany. Available online: https://www.libreoffice.org/ (accessed on 24 November 2022).
VMware. Version 10.0, VMware, Palo Alto, America. Available online: https://www.vmware.com/ (accessed on 24 November 2022).
Zhao, Q.; Huang, C.; Dai, L. VULDEFF: Vulnerability detection method based on function fingerprints and code differences. Knowl.-Based Syst. 2021, 260, 1101391. [Google Scholar] [CrossRef]
Bowman, B.; Huang, H.H. VGRAPH: A Robust Vulnerable Code Clone Detection System Using Code Property Triplets. In Proceedings of the 2020 IEEE European Symposium on Security and Privacy, Genoa, Italy, 7–11 September 2020. [Google Scholar]
Ding, Y.; Suneja, S.; Zheng, Y.; Laredo, J.; Morari, A.; Kaiser, G.; Ray, B. VELVET: A noVel Ensemble Learning approach to automatically locate VulnErable sTatements. In Proceedings of the 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering, Honolulu, HI, USA, 15–18 March 2022. [Google Scholar]
Cao, S.; Sun, X.; Bo, L.; Wei, Y.; Li, B. Bgnn4vd: Constructing bidirectional graph neural-network for vulnerability detection. Information and Software Technology. Knowl.-Based Syst. 2021, 136, 106576. [Google Scholar]
Zhang, H.; Qian, Z. Precise and accurate patch presence test for binaries. In Proceedings of the 27th USENIX Security Symposium, Baltimore, MD, USA, 15–17 August 2018. [Google Scholar]
Duan, R.; Bijlani, A.; Ji, Y.; Alrawi, O.; Xiong, Y.; Ike, M.; Saltaformaggio, B.; Lee, W. Automating Patching of Vulnerable Open-Source Software Versions in Application Binaries. In Proceedings of the 28th USENIX Security Symposium, San Diego, CA, USA, 24–27 February 2019. [Google Scholar]
Lancichinetti, A.; Radicchi, F.; Ramasco, J.J.; Fortunato, S. Finding statistically significant communities in networks. PLoS ONE 2011, 6, e18961. [Google Scholar] [CrossRef] [PubMed]
Xu, X.; Zheng, Q.; Yan, Z.; Fan, M.; Jia, A.; Liu, T. Interpretation-enabled software reuse detection based on a multi-level birthmark model. In Proceedings of the 2021 43rd IEEE/ACM International Conference on Software Engineering, ICSE’21, Madrid, Spain, 22–30 May 2021. [Google Scholar]
Yang, C.; Xu, Z.; Chen, H.; Liu, Y.; Gong, X.; Liu, B. ModX: Binary Level Partially Imported Third-Party Library Detection via Program Modularization and Semantic Matching. In Proceedings of the 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE), Pittsburgh, PA, USA, 25–27 May 2022. [Google Scholar]
Blondel, V.D.; Guillaume, J.L.; Lambiotte, R.; Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008, 10, P10008. [Google Scholar] [CrossRef] [Green Version]
Mohammadi, S.; Izadkhah, H. A new algorithm for software clustering considering the knowledge of dependency between artifacts in the source code. Inf. Softw. Technol. 2019, 105, 252–256. [Google Scholar] [CrossRef]
Zamli, K.Z.; Din, F.; Ramli, N.; Ahmed, B.S. Software Module Clustering Based on the Fuzzy Adaptive Teaching Learning Based Optimization Algorithm. In Proceedings of the Intelligent and Interactive Computing, IIC’18, Turin, Italy, 10–14 September 2018. [Google Scholar]
Sun, J.; Ling, B. Software Module Clustering Algorithm Using Probability Selection. Wuhan Univ. J. Nat. Sci. 2018, 23, 93–102. [Google Scholar] [CrossRef]
Hatami, E.; Arasteh, B. An efficient and stable method to cluster software modules using ant colony optimization algorithm. Supercomputing 2019, 76, 6786–6808. [Google Scholar] [CrossRef]
Varghese, R.B.G.; Raimond, K.; Lovesum, J. A novel approach for automatic remodularization of software systems using extended ant colony optimization algorithm. Inf. Softw. Technol. 2019, 114, 107–120. [Google Scholar] [CrossRef]
Psarras, C.; Diamantopoulos, T.; Symeonidis, A. A Mechanism for Automatically Summarizing Software Functionality from Source Code. In Proceedings of the IEEE 19th International Conference on Software Quality, Reliability and Security, QRS’19, Sofia, Bulgaria, 22–26 July 2019. [Google Scholar]
Saied, M.A.; Ouni, A.; Sahraoui, H.; Kula, R.G.; Inoue, K.; Lo, D. Improving reusability of software libraries through usage pattern mining. J. Syst. Softw. 2018, 145, 164–179. [Google Scholar] [CrossRef]
Bhoraskar, R.; Han, S.; Jeon, J.; Azim, T.; Chen, S.; Jung, J.; Nath, S.; Wang, R.; Wetherall, D. Brahmastra: Driving Apps to Test the Security of Third-Party Components. In Proceedings of the 23rd USENIX Security Symposium, San Diego, CA, USA, 20–22 August 2014. [Google Scholar]
Backes, M.; Bugiel, S.; Derr, E. Reliable third-party library detection in android and its security applications. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS’16, Vienna, Austria, 24–28 October 2016. [Google Scholar]
Alrabaee, S.; Shirani, P.; Wang, L. FOSSIL: A resilient and efficient system for identifying FOSS functions in Malware binaries. ACM Trans. Priv. Secur. 2018, 21, 1–34. [Google Scholar] [CrossRef]
Karande, V.; Caballero, J.; Chandra, S.; Khan, L.; Lin, Z.; Hamlen, K. BCD: Decomposing binary code into components using graph-based clustering. In Proceedings of the 2018 ACM Asia Conference on Computer and Communications Security, ASIA CCS’18, Cheon, Republic of Korea, 4–8 June 2018. [Google Scholar]
Newman, M.E.J. Fast algorithm for detecting community structure in networks. Phys. Rev. E 2004, 69, 066133. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lancichinetti, A.; Radicchi, F.; Ramasco, J.J. Statistical significance of communities in networks. Phys. Rev. E 2010, 81, 046110. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Newman, M.E.J. Modularity and community structure in networks. Proc. Natl. Acad. Sci. USA 2006, 103, 8577–8582. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Rosvall, M.; Bergstrom, C.T. Maps of random walks on complex networks reveal community structure. Proc. Natl. Acad. Sci. USA 2008, 105, 1118–1123. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Gregory, S. Finding overlapping communities in networks by label propagation. New J. Phys. 2010, 12, 103018. [Google Scholar] [CrossRef]
Woo, S.; Hong, H.; Choi, E.; Lee, H.; Symposium, U.S. Movery: A Precise Approach for Modified Vulnerable Code Clone Discovery from Modified Open-Source Software Components. In Proceedings of the 31st USENIX Security Symposium, Boston, MA, USA, 10–12 August 2022. [Google Scholar]
BinaryAI: The Neural Search Engine for Binaries. Available online: https://binaryai.readthedocs.io/en/latest/ (accessed on 18 November 2022).

Figure 1. Workflow of BMVul.

Figure 2. Workflow of DBM.

Figure 3. The comparison of

T P_{1 m}

.

Figure 3. The comparison of

T P_{1 m}

.

Figure 4. Frequently reused components vulnerabilities.

Table 1. Applications of software modularization.

Type	Approach	Year	Venue	Method
Information recovery	Mohammadi	2019	Information and Software Technology	neighborhood tree algorithm
	ATLBO	2018	IIC	fuzzy adaptive fuzzy teaching learning
	Sun	2018	Wuhan Univ. J. Nat. Sci	probability selection
Restructuring	Hatami	2019	Supercomputing	ACO algorithm
Restructuring	Varghese, R. B. G.	2019	Information and Software Technology	ACO algorithm
Component identification	Psarras	2019	ORS	semantic clustering
Component identification	LibCUP	2018	Journal of Systems and Software	a variant of the standard clustering

Table 2. Open-source vulnerability detection.

Approach	Venue	Method	Application
Brahmastra	USENIX	dynamic drivers	android
OSSPolice	CCS	layered indexing scheme	android
LibScout	CCS	resizing code obfuscation	android
B2SMatcher	Cybersecuriy	rough matching and exact matching	binary files
FOSSIL	ACM Transactions on Privacy and Security	Bayesian networks with greater resilience to code obfuscation	binary files
FIBER	USENIX	fine-grained patch presence test	binary files

Table 3. Dataset I used for evaluation.

Binary			Reused Component
Name	#Func	Size(KB)	Reused Component
redis-server	4287	10,803	Hiredis, lua
libblosc	1005	936	zstd, zlib-ng, lz4
ssldump	1823	1461	libpcap
tcpdump	3788	2884	libcap-ng, libpcap
openvpn	7560	16,124	Lzo, openssl
git	13,045	6743	zlib
screen	1146	660	pam, libxcrypt
dpkg-deb	914	498	zlib, bzip, xz
curl	6357	3908	zlib, openssl, libssh2, krb5, keyutils, nghttp2
wget	5536	3945	zlib, openssl, libunstring, libidn, pcre2
rsync	1622	1774	zlib, openssl, xxhash, zstd, acl, attr, popt
precomp	2082	2152	zlib, minizip
lzbench	3256	3008	bzip, lzo, xz, zlib, zstd, brotli, libdeflate, c-blosc
turbench	3027	3201	zstd, bzip2, lzo, zlib, brotli, libdeflate, zlib-ng

Table 4. Dataset II used for evaluation.

Name	Description
openssl	encryption and decryption
sqlite	database
libsndfile	sound/video processing
libjpeg-turbo, libpng	image processing
freetype, libtiff	font processing
unrar, zlib	compression
expat	parser

Table 5. Comparison of BMVul with other methods in terms of effectiveness.

Method	Precision	P1m	F1-Score
BMVul	75.00%	77.80%	56.50%
Louvain detection	72.70%	69.60%	55.80%
B2SFinder	47.00%	-	52.10%

Table 6. Evaluation of

1 M_{R a t i o}

.

Table 6. Evaluation of

1 M_{R a t i o}

.

Method	BMVul	Louvain-Detection
$1 M_{R a t i o}$	87.50%	66.70%

Table 7. Open-source vulnerability function detection results.

Binary	Vulnerable Componenet	CVE	Vulnerable Function	Reuse
openvpn	openssl	CVE-2014-3571	ssl3_read_n	✓
		CVE-2013-0166	ASN1_item_verify	✓
		CVE-2014-8275	ASN1_verify	×
		CVE-2013-4353	ssl3_take_mac	×
			tls_construct_finished	×
		CVE-2015-0288	X509_to_X509_REQ	✓
		CVE-2017-3730	ssl_ctx_make_profiles	✓
		CVE-2014-3513	tls_construct_cke_dhe	×
		CVE-2017-3731	chacha20_poly1305_ctrl	×
		CVE-2017-3733	tls1_change_cipher_state	✓
lzbench	zstd	CVE-2019-11922	ZSTD_encodeSequences	✓
			ZSTD_compressSequences_internal	×
turbobench			ZSTD_encodeSequences_body	×
			ZSTD_encodeSequences_default	×
Libblosc			writeSequences	×
redis-server	lua	CVE-2018-11218	mp_encode_lua_table_as_map	✓
			mp_decode_to_lua_hash	✓
			mp_pack	✓
			mp_encode_lua_table_as_array	✓
			mp_decode_to_lua_array	✓
			mp_unpack_full	✓
	hiredis	CVE-2021-21309	sdsMakeRoomFor	×
			sdsnewlen	×
ssldump	libpcap	CVE-2019-15164	Daemon_msg_open_req	×
tcpdump				×
curl	nghttp2	CVE-2020-11080	nghttp2_strerror	×
			nghttp2_session_mem_recv	×
			session_new	×
			nghttp2_session_upgrade_internal	×
		CVE-2018-11743	gc_gray_mark	×
			gc_mark_children	×
			init_copy	×
			mrb_obj_id	×
			obj_free	×
			obj_iv_p	×
		CVE-2017-9527	mark_context_stack	×
		CVE-2018-10199	mrb_io_initialize_copy	×

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, X.; Cai, R.; Yin, X.; Shao, W.; Liu, S. Searching Open-Source Vulnerability Function Based on Software Modularization. Appl. Sci. 2023, 13, 701. https://doi.org/10.3390/app13020701

AMA Style

Guo X, Cai R, Yin X, Shao W, Liu S. Searching Open-Source Vulnerability Function Based on Software Modularization. Applied Sciences. 2023; 13(2):701. https://doi.org/10.3390/app13020701

Chicago/Turabian Style

Guo, Xixi, Ruijie Cai, Xiaokang Yin, Wenqiang Shao, and Shengli Liu. 2023. "Searching Open-Source Vulnerability Function Based on Software Modularization" Applied Sciences 13, no. 2: 701. https://doi.org/10.3390/app13020701

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Searching Open-Source Vulnerability Function Based on Software Modularization

Abstract

1. Introduction

2. Related Work

2.1. Software Modularization

2.1.1. Software Information Recovery

2.1.2. Software Restructuring

2.1.3. Software Component Identification

2.2. Open-Source Vulnerability Detection

3. Methodology

3.1. Software Component Analysis Based on Program Modularization

3.1.1. Program Modularization Method Based on Directed Graph

3.1.2. Software Component Identification

3.2. Open-Source Vulnerability Detection

3.2.1. Open-Source Component Vulnerability Function Detection

3.2.2. Binary Vulnerability Function Detection

4. Results and Analysis

4.1. Datasets

4.2. Compared Approachs

4.3. Evaluation Metrics

4.4. Effectiveness of Component Detection

4.5. Evaluation of Matching Unique Module

4.6. Open-Source Vulnerability Function Detection and Analysis of Binary Modules

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI