HDL-ODPRs: A Hybrid Deep Learning Technique Based Optimal Duplication Detection for Pull-Requests in Open-Source Repositories

Alotaibi, Saud S.

doi:10.3390/app122412594

Open AccessArticle

HDL-ODPRs: A Hybrid Deep Learning Technique Based Optimal Duplication Detection for Pull-Requests in Open-Source Repositories

by

Saud S. Alotaibi

Department of Information Systems, College of Computer and Information Systems, Umm Al-Qura University, Makkah 24382, Saudi Arabia

Appl. Sci. 2022, 12(24), 12594; https://doi.org/10.3390/app122412594

Submission received: 4 October 2022 / Revised: 30 October 2022 / Accepted: 16 November 2022 / Published: 8 December 2022

(This article belongs to the Special Issue Advances in Applied Deep Learning Based Methods and Architectures for Data Analytics)

Download

Browse Figure

Versions Notes

Abstract

Recently, open-source repositories have grown rapidly due to volunteer contributions worldwide. Collaboration software platforms have gained popularity as thousands of external contributors have contributed to open-source repositories. Although data de-duplication decreases the size of backup workloads, this causes poor data locality (fragmentation) and redundant review time and effort. Deep learning and machine learning techniques have recently been applied to identify complex bugs and duplicate issue reports. It is difficult to use, but it increases the risk of developers submitting duplicate pull requests, resulting in additional maintenance costs. We propose a hybrid deep learning technique in this work on the basis of an optimal duplication detection is for pull requests (HDL-ODPRs) in open-source repositories. An algorithm used to extract textual data from pull requests is hybrid leader-based optimization (HLBO), which increases the accuracy of duplicate detection. Following that, we compute the similarities between pull requests by utilizing the multiobjective alpine skiing optimization (MASO) algorithm, which provides textual, file-change, and code-change similarities. For pull request duplicate detection, a hybrid deep learning technique (named GAN-GS) is introduced, in which the global search (GS) algorithm is used to optimize the design metrics of the generative adversarial network (GAN). The proposed HDL-ODPR model is validated against the public standard benchmark datasets, such as DupPR-basic and DupPR-complementary data. According to the simulation results, the proposed HDL-ODPR model can achieve promising results in comparison with existing state-of-the-art models.

Keywords:

duplicate pull requests; deep learning; textual extraction; similarity computation; duplicate detection

1. Introduction

Open-source software (OSS) has been used since the earliest stages of computer and software development [1]. Programmers and developers routinely shared their software for free in the early days of software development. The advent of firms in software development with a profit motive hampered the culture of sharing software source code. In recent years, the software industry has witnessed the development of open-source software [2]. Such software is now used more frequently in library settings. Libraries nowadays must balance the distribution of resources between old and modern technologies, combine established and developing forms, and develop information management policies and procedures [3]. A logical arrangement and quick access to normally enormous amounts of digital data are provided by modern digital libraries, which offer an integrated collection of services for information gathering, classifying, storing, searching, protecting, and retrieval. Contributors frequently utilize pull-request techniques to submit their code changes to reviewers or repository owners on social coding platforms [4] like GitHub. Volunteers from all over the world who are geographically separated from one another work on a repository invisibly. At the core of all OSS repositories are the necessary communication patterns that ensure that everyone is on the same page. The dawn of GitHub as a distributed version control system allowed enterprises to rethink the way they communicate about the state of and updates to large application code bases. Pull requests serve as nervous systems to communicate about changes in these projects. Over time, enterprise developers have adopted a variety of approaches for pull-request workflows to facilitate better communication about code status updates without overwhelming developers. Each approach has its benefits and drawbacks, and developers may have strong opinions about what works best [5,6]. Pull requests promote the incorporation and enhancement of contributions in distributed OSS repositories, particularly within open-source communities. Maintainers and reviewers evaluate the submission.

The initial review and collaborative improvement process is fraught with disagreements, opinions, and feelings. It has a significant impact on the community’s mood, demotivates and repels contributors, and fails to protect code quality [7,8], putting the community’s existence at danger. Deep learning and machine learning approaches are employed to explicate the mechanics of evaluating pull requests in varied OSS communities from the developers’ and maintainers’ viewpoints. The nature of parallel and distributed development is for numerous contributors to submit pull requests for similar or identical development tasks [9]. Comparable pull requests, in general, used completely identical or partially copied pull requests, as well as pull requests with common terms in their titles and descriptions, and files with similar alterations [10]. Similar pull requests are prevalent in popular repositories that attract authors from all around the world. Unfortunately, each contributor works alone, and the GitHub infrastructure does not facilitate such collaborative efforts. This results in duplicative development efforts [11,12].

Through via pull requests, the contributors contribute their work for open-source projects that could be approved or declined by reviewer teams in the pull-based development approach [13,14,15,16,17,18,19,20]. Contributors may make changes to their code, leading in the code review process which has gone through multiple iterations, which takes time for both contributors and reviewers. According to GitHub research of inactive but available assignees, certain projects have a high number of inactive assignees [13]. Developers are inactive assignees because they work for organizations and are automatically assigned to specific projects inside those organizations. By utilizing a thorough mapping study [14], the key studies on OSSECOs published in academic journals are identified and analyzed. The external and internal developers in the OSS projects provide the pull-request contribution behavior which is evaluated by multicase study [15]. How and why do developers fork, and from whom do they fork on GitHub based on what empirical studies? The developer forks repositories in order to make pulling requests, fix bugs, and add new functionality [16]. They discover repositories to fork from multiple sources, including search engines as well as external websites. Investigation of metadata and data of OSS projects is conducted in two steps by using questionnaires to gather further evidence [17]. For calculating long-term and short-term issue closing rates, active OSS projects and a large sample of applicable data, the GHTorrent repository is utilized [18]. Measures of labor centralization, measures of internal project workflows, and measures of developer social networks on issue closure rates are only a few examples of how OSS projects and teams affect the optimization of design metrics. Two-stage hybrid classification (TSHC) algorithm [19] categorizes review comments by combining machine learning and rule-based techniques. In order to identify the factors, such as acceptance and delay, that have an impact on the pull-based development strategy in the context of continuous integration, a quantitative analysis [20] is carried out.

Our contributions include an optimal pull-request duplication detection method. It uses a hybrid deep learning technique (HDL-ODPRs) for OSS repositories. The HDL-ODPRs method has the following key contributions:

A hybrid leader-based optimization (HLBO) algorithm is used to extract textual data from pull requests, which ensures the detection accuracy.
A multi-objective alpine skiing optimization (MASO) algorithm is used to compute the similarities between pull requests.
A generative adversarial network (GAN) is merged with the global search (GS) algorithm to compute the duplicate contributions in pull requests.
Finally, our proposed model is validated against the public standard benchmark datasets such as DupPR-basic and DupPR-complementary data [21].

This work’s remainder is structured as follows.

Recent literature on detection of pull-request duplication in OSS is described in Section 2 and Section 3 and discusses the problem technique as well as the system design of our proposed HDL-ODPR model. Section 4 discusses the step-by-step approach of our HDL-ODPR model. Section 5 reports on the simulation results and findings of pull-request duplication-detection algorithms. This paper concludes in Section 6.

2. Related Work

To facilitate comprehension of our research, we provide a brief summary of background information and relevant studies. The research gap related to the pull requests duplication detection in OSS is given in Table 1.

An approach to recommend technical and managerial reviewers for GitHub pull requests is proposed by Cheng et al. [22]. Target pull requests are evaluated by utilizing the hybrid recommendation method. They conducted trials on two major GitHub projects and evaluated the strategy with pull requests. Their recommendation model performed better and differentiates participation types well.

Hu et al. [23] have performed an ongoing investigation into the effects of multi-reviewing the resolution of pull requests in GitHub open-source projects. Developers have chosen how many pull requests are examined, and examined how a different number of pull requests has varied results. Their simulation findings demonstrated that increased review switching may cause a pull request’s resolution time to increase.

TF-IDF and a deep learning technique was proposed by Zhang et al. [24] (word and document embedding). They take into account the frequent characteristics over the whole issue concerning document corpus TF-IDF. The deep significance of various words and various documents is measured, respectively, by word embedding and document embedding. Then, depending on the total score created by combining the three similarity ratings, they suggest connected topics.

Yang et al. [25]’s proposal of the Repo Like technique allowed developers to recommend repositories on the basis of the learning-to-rank methodology and linear combination. The personalized suggestion process was carried out by using a learning-to-rank-based methodology. The findings demonstrate that when offering 20 candidates, the Repo Like technique has a hit ratio over 25%, demonstrating that it may suggest repositories closely similar to those of social developers.

The CROSSSIM framework is proposed by Nguyen et al. [26] to calculate OSS project similarities. They employed the graph model to account for the entire developer community, as well as OSS projects, libraries, and various artefacts, as well as their mutual interactions. SimRank is utilized in the CROSSSIM design to compute graph similarity.

Yang et al. [27] created a model to define the features of many factors relating to software development concerning developers. Many OSS platforms are then analyzed for the purpose of building the portrait by using a process that makes use of code analysis, web data analysis, and text analysis approaches. Created feature images are clearly exhibited online to aid in the quick understanding of developers and the improvement of decision-making during collaborative software development. The developer images were applied to two issues, such as programming task and code recommendation assignment, using a prototype process that displayed the created feature images.

To forecast which pull requests on GitHub will be approved Jing et al. [28] have presented the CTCPPre method. CTCPPre considers, in addition to the code characteristics of altered changes, contributor features of previous developer actions, the development environment project aspects, and the text features of pull-request descriptions. CTCPP efficacy was tested on 28 projects with a total of 221,096 pull requests.

Machine learning methods were applied to software development data from GitHub by Eluri et al. [29] in order to predict whether a contributor will become an LTC of a project. The study demonstrates that random forest obtains an AUC better than 0.90. They were able to do this by employing 50% fewer features and by predicting LTC immediately after a new contributor joined, rather than waiting a month.

To forecast accepted pull requests, Jiang et al. [30] suggested the XGBoost classifier. Pull-request acceptance and rejection probabilities are calculated by XGPredict. XGPredict also offered acceptance and rejection probability for pull requests, assisting integrators in making choices during code review. The XGPredict technique was used to forecast the success of pull requests, enhancing the code review procedure.

A ground-truth dataset was offered by Golzadeh et al. [31] based on a manual examination of pull request and issue comments in 5000 different Github accounts, 527 of which were found to be bots. The number of empty and nonempty comments from each account is utilized as the primary characteristic in an automated classification technique used to identify bots. They attained a weighted average precision, recall, and F1 score of 0.98 on a test set with 40% of the data.

3. Problem Methodology and System Design

3.1. Problem Methodology

Text information has been presented by Li et al. [21] as a mechanism for automatically detecting duplicate pull requests on GitHub. They assess how similar the text is to other pull requests already in existence before returning a candidate list of the most comparable ones. Then, this model was tested against three well-known GitHub projects, including Rails, Elastic search, and Angular.JS. The results indicated that there is a 55.3% to 71.0% chance of finding duplication based on the combination of title and description similarities. A mechanism for automatically detecting duplicate contributions in the pull-based paradigm [32] has also been devised by the authors, who compared textual and change similarity. They begin by calculating the resemblance between their own work and that of other previously published works. Finally, this method provides a newer contributions list that is mostly similar to the possible duplicate contribution by comparing the change and textual similarities. When textual and change similarity are used together, 83.4% of the duplicates are found on average, compared to 54.8% when textual similarity is used alone and 78.2% when change similarity is used alone, according to simulation data. The issues of duplicate pull requests were further examined in [33] for situations that lead to multiple pull requests and how the integrator preferentially handles multiple pull requests. The sustainability of the various OSS initiatives depends on the continuance of their development communities. Communities can keep developers from leaving and get them to come back if they know how and why developers stop working or take breaks. Duplicate efforts have generally been explored in the literature, including duplicate queries, duplicate pull requests, and duplicate bug reports. Studies have mostly concentrated on outlining the dangers in duplicates and suggesting ways to find and get rid of them. The indiscriminate use of pull requests in OSS repositories can result in a variety of problems, including duplicated, verbose, and incomplete descriptions, which impede the project’s growth and upkeep.

However, it is unclear whether the pull request template has been widely used on GitHub and what effects it might have. Such situations can reduce the effectiveness of cooperative development. Duplicate pull requests result in wasted time and effort, both during the review and update processes that follow submission as well as during the contributors’ initial work before submission. Consequently, it is important to help the project team and keep core developers and their knowledge from leaving. However, not much is known about the stages developers go through when they stop working on a project or how they move from one stage to the next. Integrators have the discretion to add new code and accept pull requests in the pull-based development approach. Pull requests should ideally be allocated to integrators and examined as soon as possible after being submitted. However, integrators often have trouble getting through pull requests in a timely manner because there are so many of them in popular projects. Additionally, the analysis of such existing works [21,32,33] takes a lot of time, which results in a rising number of pull requests. Therefore, the automatic duplicate detection method is needed to save time and effort in the review process. To solve those problems, the optimal duplicate pull requests detection method is proposed, using hybrid deep learning (HDL-ODPRs). The key objectives of HDL-ODPR method include the following:

To introduce optimal duplicate pull requests detection method to automatically detect duplicates to improve the accuracy.
To propose optimization algorithm to extract textual data from pull requests in OSS repositories.
To develop hybrid deep learning techniques for the similarity computation (textual, file-change, and code-change similarities) to further enhance the duplicate detection accuracy.

3.2. System Design

Figure 1 depicts the general design of our proposed approach. The contributors need to update the main repository with fresh features or fix bugs already present. Instead of adding to the original repository, contributors first make a copy of it and set up branches. This helps keep changes organized and separate from the original source. The contributors send in their changes and then test them in a safe way in their own branch. As part of the changes or commitments, files could be made, changed, renamed, moved, or deleted. When modifications are prepared, contributors open pull requests for all of their changes so that other team members can examine and discuss them. We examine and discuss these changes before accepting, providing feedback, or rejecting them because PRs allow contributors to share their changes and extra information with other team members. When a reviewer provides feedback, the contributor deals with it and modifies the pull requests. The reviewing procedure was then repeated in order to go over the updates. The changes are then merged into the primary repository after being authorized.

4. Proposed Methodology

We outline the workings of our proposed HDL-ODPR approach in this section. We give a textual data extraction from the pull requests at the outset. The next subsections describe the steps of the strategy in more depth.

4.1. Textual Data Extraction from Pull Requests

The way to extract and handle pull-request data is discussed in this section. We extract the title, description, and modified files for each pull request. A title, description, and list of the files that have been modified are required for each pull request in the GitHub project. Data includes implicitly relevant information about the context and goals of pull requests. To specifically identify and describe the buggy code, the description could include a snippet of code. Each pull-request also comes with a list of the changed files in which a developer made a change or added something. We believe that there is a significant likelihood that two or more pull requests that contain the same file paths are connected and working on the same functionality or problem. The gathered textual data from each pull request is modified by using the hybrid leader-based optimization (HLBO) technique. These include stop word removal, tokenization, stemming, and the elimination of punctuation, and special characters. These conditions can cause the algorithm to quickly converge on a local best solution, making it hard for the algorithm to find the main best area in the search space. Consequently, relying too much on the process of updating the algorithm population to certain members makes it harder for the algorithm to discover new things. In the HLBO algorithm, each member of the algorithm population in the search space is kept up to date and led by a unique hybrid leader. This hybrid leader is chosen based on three different members: the best member, a random member, and the member that matches the best member. HLBO is like other population-based algorithms that can be modelled mathematically with a matrix as follows:

Y = {[\begin{array}{l} Y_{1} \\ ⋮ \\ Y_{j} \\ ⋮ \\ Y_{n} \end{array}]}_{n \times M} = {[\begin{array}{l} Y_{11} \dots Y_{1 i} \dots Y_{1 M} \\ ⋮ ⋱ ⋮ ⋱ ⋮ \\ Y_{j 1} \dots Y_{j i} \dots Y_{1 M} \\ ⋮ ⋱ ⋮ ⋱ ⋮ \\ Y_{n 1} \dots Y_{n i} \dots Y_{n M} \end{array}]}_{n \times M},

(1)

where

Y_{j, i}

is the value of the ith variable determined by the jth candidate solution. In addition, m is the number of problem variables, n is the size of the HLBO population,

Y_{j}

is the ith candidate solution, and Y is the HLBO population. Taking into account the limitations of the problem variables, each member of the population Y starts out in a random place,

y_{j, i} = l a_{j} + R \cdot (u a_{j} - l a_{i}), i = 1, 2, \dots n,

(2)

where R is a random real number drawn at random from the range [0, 1],

u a_{j}

is the upper bound, and

l a_{i}

is the lower bound of the ith issue variable. Each of the possible solutions is based on the members of the population Y, which are given by a vector.

Based on these results, the objective function of the problem is judged,

f = {[\begin{array}{l} f_{1} \\ ⋮ \\ f_{j} \\ ⋮ \\ f_{n} \end{array}]}_{n \times M} = {[\begin{array}{l} f (Y_{1}) \\ ⋮ \\ f (Y_{j}) \\ ⋮ \\ f (Y_{n}) \end{array}]}_{n \times 1}

(3)

where

f_{j}

represents the objective function value obtained from the jth candidate solution and f represents the objective function vector. The candidate solutions’ quality is gauged by the values found for the function. The best member (

Y_{b e s t}

) is the one which gives the biggest value, and the worst member (

Y_{w o r s t}

) is the one which gives the lowest value. The ability of each of these three elements to provide a bigger value determines how much each contribution makes to the creation of the hybrid leader. The following formula is used to calculate how well each population member did in presenting the candidate solution:

p_{j} = \frac{f_{j} - f_{w o r s t}}{\sum_{i = 1}^{n} (f_{j} - f_{w o r s t})}, j \in {1, 2, \dots n} .

(4)

The below equation determines each member’s participation coefficients:

Q C_{j} = \frac{p_{j}}{p_{j} + p_{b e s t} + p_{K}}

(5)

Q C_{b e s t} = \frac{p_{b e s t}}{p_{j} + p_{b e s t} + p_{K}}

(6)

Q C_{K} = \frac{p_{K}}{p_{j} + p_{b e s t} + p_{K}}

(7)

where

Q C_{K}

,

Q C_{b e s t}

,

Q C_{j}

are the participation coefficients of the Kth member (K is an integer determined randomly from the set ({1, 2, …, N}) and the best member-th member, respectively, during hybrid layer formation. In addition, i, k ∈ {1, 2, …, N}, k = i, qi is the quality of the jth candidate solution, and

f_{w o r s t}

is the value of the objective function of the worst candidate solution. For each person in the population, the hybrid leader is constructed after the participation coefficients have been determined. We have

G l_{j} = Q C_{j} \cdot Y_{j} + Q C_{b e s t} \cdot Y_{b e s t} + Q C_{K} \cdot Y_{K},

(8)

where, Y_K is a randomly chosen population member and

G l_{j}

is the jth member’s hybrid leader. This member’s row number in the population matrix is represented by the index K. The HLBO algorithm guides the calculation of each population member’s new position within the search space. If the goal function value is higher than it was in the prior position, the corresponding member will accept the new location; otherwise, it will remain in the previous place. We have

y_{j, i}^{n e w, Q 1} = \{\begin{cases} y_{j, i} + R \cdot (G l_{j, i} + I \cdot y_{j, i}), f_{G l_{j}} < f_{j}; \\ y_{j, i} + R \cdot (y_{j, i} - G l_{j, i}), e l s e, \end{cases}

(9)

Y_{j} = \{\begin{cases} Y_{j}^{n e w, Q 1}, f_{j}^{n e w, Q 1} < f_{j}; \\ Y_{j}, e l s e, \end{cases},

(10)

where, the first phase of HLBO gives the objective function

f_{j}^{n e w, Q 1}

. The parameter

Y_{j}^{n e w, Q 1}

denotes the jth member’s new position.

Y_{j, i}^{n e w, Q 1}

is its i-th dimension. In addition, real random number ‘r’ is chosen at random from the range [0, 1], which represents the objective function value produced from the jth member’s hybrid leader. Furthermore, I represents the randomly selected integer over the interval [1, 2], and j. Each member of the population has a neighborhood that is taken into account by the HLBO algorithm. Each member can switch positions in this neighborhood by doing local searches to find a place with a higher value for the objective function. This local search is meant to improve and boost the HLBO algorithm’s ability to be used. We have

Y_{j, i}^{n e w, Q 2} = y_{j, i} + (1 - 2 r) \cdot r \cdot (1 - \frac{s}{S}) \cdot y_{j, i}

(11)

Y_{j} = \{\begin{cases} Y_{j}^{n e w, Q 2}, f_{j}^{n e w, Q 2} < f_{j}; \\ Y_{j}, e l s e, \end{cases}

(12)

Based on the second phase of HLBO, R is a constant equal to 0.2, S is the maximum number of iterations, s is the iteration counter,

Y_{j, i}^{n e w, Q 2}

is the new position of the jth member.

f_{j}^{n e w, Q 2}

is the objective function value, and

Y_{j, i}^{n e w, Q 2}

is its ith dimension. Algorithm 1 depicts the working process of textual data extraction from pull requests using HLBO.

Algorithm 1. Extraction of Textual Information HLBO
$Input : l a_{i}$ $and u a_{j}$
$Output : Y_{j, i}^{n e w, Q 2}$
1.	Initialize the random population
2.	Adjust n and s.
3.	$Set constraints of the problem variables y_{j, i} = l a_{j} + R \cdot (u a_{j} - l a_{i}),$
4.	If j = 0 and i = 1
5.	$Compute candidate solution p_{j} = \frac{f_{j} - f_{w o r s t}}{\sum_{i = 1}^{n} (f_{j} - f_{w o r s t})},$
6	$Define the population G l_{j} = Q C_{j} \cdot Y_{j} + Q C_{b e s t} \cdot Y_{b e s t} + Q C_{K} \cdot Y_{K}$
7.	$Update the condition using y_{j, i}^{n e w, Q 1} = \{\begin{cases} y_{j, i} + R \cdot (G l_{j, i} + I \cdot y_{j, i}), f_{G l_{j}} < f_{j}; \\ y_{j, i} + R \cdot (y_{j, i} - G l_{j, i}), e l s e, \end{cases}$
8.	Update the final values
9.	End else
10.	End

4.2. Compute Similarities

When describing the same subject, different people may use different words and expressions. This is especially true in the context of collaborative software development, which involves programmers from all over the world. Consequently, it is not possible to determine whether two pull requests are identical by utilizing solely natural language content. The knowledge of changes may be more useful in certain circumstances. It makes sense that developers would change comparable or identical files to complete the same objective, such as fixing a bug or adding a feature. As a result, in addition to text similarity, we also examine change similarity, which includes similarity in file changes and code changes. In this section, compute pull-request similarities using a multiobjective alpine skiing optimization (MASO) algorithm. The behavior of skiers fighting for the championship in competitive tournaments serves as the primary source of motivation for ASO. Finding the trajectory of the performance measure with best value out of all those that are acceptable is the trajectory optimization task. Employ the MASO algorithm to further improve the ASO’s fitness. The system performance measure definitions are used to initialize the MASO algorithm,

I = g (y (s_{f}), s_{f}) + \int_{s_{0}}^{s_{f}} h (y (s), u (s), s) d t,

(13)

where the initial and final times are

s_{0}

and

s_{f}

respectively, and g and h are scalar functions. Think about a skier who is descending an angle α down a slope. We consider that every bend is entirely etched. For directions, ξ₁ and ξ₂, Newton’s second law are applied and the result is

\{\begin{cases} M {\ddot{ξ}}_{1} = M h \sin α - f_{f d} \frac{{\dot{ξ}}_{1}}{\sqrt{{\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2}}} - f_{R} \frac{{\dot{ξ}}_{2}}{\sqrt{{\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2}}} sgn \dot{φ} \\ M {\ddot{ξ}}_{2} = - f_{f d} \frac{{\dot{ξ}}_{2}}{\sqrt{{\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2}}} + f_{R} \frac{{\dot{ξ}}_{1}}{\sqrt{{\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2}}} sgn \dot{φ} \end{cases}

(14)

f_{f d} = f_{f} + f_{d},

(15)

where air resistance (drag),

f_{d}

, the friction’s sum is

f_{f d}

,

f_{f}

which are expressed as follows:

f_{f} = μ M h \cos α,

(16)

f_{d} = K_{1} v^{2} = K_{1} ({\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2}) = M K ({\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2}),

(17)

and

f_{R} = M ({\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2}) | K | + M h \sin α \frac{{\dot{ξ}}_{2}}{\sqrt{{\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2}}} t h n \dot{φ} .

(18)

The snow’s perpendicular force on the skis has a plane component ξ₁ − ξ₂. Divide double-side equations by using m,

\{\begin{cases} {\ddot{ξ}}_{1} = h \sin α - f_{f d 1} \frac{{\dot{ξ}}_{1}}{\sqrt{{\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2}}} - f_{R 1} \frac{{\dot{ξ}}_{2}}{\sqrt{{\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2}}} sgn \dot{φ} \\ {\ddot{ξ}}_{2} = - f_{f d 1} \frac{{\dot{ξ}}_{2}}{\sqrt{{\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2}}} + f_{R 1} \frac{{\dot{ξ}}_{1}}{\sqrt{{\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2}}} sgn \dot{φ} \end{cases},

(19)

where

f_{f d 1} = \frac{f_{f d}}{M}, f_{f 1} = \frac{f_{f}}{M}

(20)

f_{d 1} = \frac{f_{d}}{M}, f_{R 1} = \frac{f_{R}}{M} .

(21)

If we introduce

y = {(y_{1}, y_{2}, y_{3}, y_{4})}^{S} = {(ξ_{1}, ξ_{2}, {\dot{ξ}}_{1}, {\dot{ξ}}_{2})}^{S},

(22)

as the system state variables’ (or just the states’) vector at time s, and

u = (u_{1}) = (| K |) .

(23)

On the basis of the system control input vector, the definition of system at time s is as follows:

\dot{y} (s) = b (y (s), u (s))

(24)

b = (\begin{array}{l} Y_{3} \\ Y_{4} \\ g \sin α - y_{3} F_{f d} (y_{3}, y_{4}) - y_{4} f_{R} (y_{3}, y_{4}, u_{1}) \\ - y_{4} f_{f d} (y_{3}, y_{4}) + - y_{3} f_{R} (y_{3}, y_{4}, u_{1}) \end{array}) .

(25)

Then,

f_{f d} = \frac{μ h \cos α + K (y_{3}^{2} + y_{4}^{2})}{\sqrt{y_{3}^{2} + y_{4}^{2}}}

(26)

f_{R} = \sqrt{y_{3}^{2} + y_{4}^{2}} u_{1} + h \sin α \frac{y_{4}}{y_{3}^{2} + y_{4}^{2}} t h n \dot{φ} .

(27)

Let

s_{0}

denote the moment the skier leaves the starting gate as well as the moment it crosses the finish line. By applying new state variables in the equation described below, it is evident that the skier begins at the start gate at point

T (y_{1 T}, y_{2 T}), T \in l_{T H}

,

y_{1} (s_{0}) = y_{1 T}, y_{2} (s_{0}) = y_{2 T} .

(28)

Assuming that the starting speed is zero, so

y_{3} (s_{0}) = v_{01 T} = 0, y_{4} (s_{0}) = v_{02 T} = 0 .

(29)

At the finish line,

f (y_{1 T}, y_{2 T}), T \in l_{f H}

is the skier’s position, which is the only boundary requirement,

y_{1} (s_{f}) = y_{1 f}, y_{2} (s_{f}) = y_{2 f} .

(30)

Note that gate lines and the start and finish ones are

l_{T H}

and

l_{f H}

, respectively. Algorithm 2 depicts the working process of similarity computation using MAOS.

Algorithm 2. Alpine Skiing Optimization
$Input : s_{0}$ $and s_{f}$ time values
$Output : l_{T H}$ $and l_{f H}$
1.	Begin;
2.	For all mesh nodes in the 0-th row
3.	For all j in [0..C) Do
4.	For all mesh nodes in the 1-th row
5.	End for
6	$Define iteration function I = g (y (s_{f}), s_{f}) + \int_{s_{0}}^{s_{f}} h (y (s), u (s), s) d t$
7.	$Set f_{d} = K_{1} v^{2} = K_{1} ({\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2}) = M K ({\dot{ξ}}_{1}^{2} + {\dot{ξ}}_{2}^{2})$
8.	$Define new state variables y_{1} (s_{0}) = y_{1 T}, y_{2} (s_{0}) = y_{2 T}$
9.	$Compute skier ’ s position y_{1} (s_{f}) = y_{1 f}, y_{2} (s_{f}) = y_{2 f}$
10.	Update the final values
11.	End

4.3. Pull Requests’ Duplicate Detection

Duplication pull requests go through the standard evaluation procedure up until the reviewers recognize their duplicate relation. Duplicate pull requests are common, according to earlier studies, and these issues have an impact on how efficiently work proceeds. We order the historical pull requests based on the combined similarity after computing the combined similarities between the historical and new pull requests. To determine whether any new pull requests duplicate any previously proposed pull requests, we suggest the top k items within the rated pull requests as candidate duplicates. Here, we introduce a hybrid deep learning technique, i.e., GAN-GS for pull-request duplicate detection. The layers of the GAN consist of two main parts. In order to create a sample of fake data that might confuse another neural network, the generator approximates the distribution of genuine data, whereas the discriminator determines if the sample creation was done from fake or real data. It is difficult to explain whether the sample is fake or real if precisely the generator shows the real data distribution. GANs are neural networks designed to become more effective by repeated resolving of a two-player minimax game issue. Loss function instructs the generator and discriminator,

\min_{H} \max_{d} v (H, d) = Ε_{y ~ q d a t a (y)} [\log d (y)] + Ε_{z ~ q z (y)} [\log d (1 - d (H (z)))],

(31)

where,

q_{z} (z)

is a latent vector y taken from a data distribution, which is normal or uniform, which is given to generator H to produce a data point H(z). The main objective function is used to optimize the hyperparameters that show the relative weighting of the margin loss in the enhanced adversarial loss compared to the baseline adversarial loss. Then, the margin loss is applied to avoid mode collapse and training instability as follows:

v_{k} = C a p s u l e d (y_{k})

(32)

\begin{array}{l} l_{m} = \sum_{K = 1}^{k} S_{K} \max {(0, M^{+} - | | v_{K} | |)}^{2} \\ + λ (1 - S_{K}) \max (0, | | v_{K} | | - m^{-})^{2}, \end{array}

(33)

where

v_{k}

represents the input to the model in our framework and

y_{k}

represents the final layer’s output vector. Here, H tries to provide some examples that most closely fuse domain X features, and

d_{1}

and

d_{2}

make an effort to distinguish between translated and original images X. The generator has two fully linked layers and deconvolution layers, and batch normalization was utilized to improve learning efficiency for all other levels except the output layer. Except for the last output layer, we used the group search (GS) method as an activation function for the other layers. We used the hyperbolic tangent function to line up the range of pixels in the final output layer with the output value of the final image. The distance function given by the below equation, which tells how far apart two places are and has the most general type of distance

l_{q}

,

| | y - x | |_{q} = {(\sum_{j = 1}^{N} | y_{j} - x_{j} |^{q})}^{1 / q} .

(34)

This is frequently employed when p is 0, 2, and ∞.

l_{0}

counts the number of non-zero components, or the number of updated data points,

l_{1}

denotes the distance in Euclidean space between the two points y and x, which is the root mean squared error, and Q denotes the biggest change in all data points when it is infinite. Routing is the process of determining the best route for sending traffic flows between source and destination nodes. Prior to the routing problem formulation, we first present the network utility function of the flow set

f^{(k)}

. Then,

U_{K} (T) = \sum_{F \in F^{(K)}} \sum_{p \in P^{(K)}} y_{p f} (T) [α \cdot G_{p} (T) - β \cdot T_{p} (T)]

(35)

is denoted as path p’s end-to-end latency over the time slot.

Here, the throughput is

G_{p} (T)

for path p atτ, α and β, which are constant coefficients formed by traffic flow needs. This utility function will simultaneously increase the overall throughput and decrease end-to-end latency of the flows. We have

M a x \sum_{T = 0}^{T} \sum_{K = 1}^{K} U_{K} (T) s . t . (1) (2) (3) .

(36)

Due to the normalization of the images, we employed MSE. MSE ranged between 0 and 1. MSE is the square of

l_{2}

distance, as a fitness function of the genetic algorithms. We define the actual fitness function by

M S E = \sum_{j = 1}^{N} {(y_{j} - x_{j})}^{2} / N

(37)

f i t n e s s = 1 - m s e .

(38)

Therefore, we adjusted the formula so that MSE closer to 1 was similar to the original image and MSE closer to 0 was different from the original image and in order to make the results easier to grasp. The workings of the GAN-GS-based duplicate pull-request detection are described in Algorithm 3.

Algorithm 3. Pull-Request Duplicate Detection Using GAN-GS
$Input : y_{k}$ _, $q_{z} (z)$ parameters
$Output : G_{p} (T)$
1.	Initialize the random population
2.	Train the discriminator and generator by the loss function $\min_{H} \max_{d} v (H, d) = Ε_{y ~ q d a t a (y)} [\log d (y)] + Ε_{z ~ q z (y)} [\log d (1 - d (H (z)))]$
3.	$Set the mode collapse v_{k} = C a p s u l e d (y_{k})$
4.	If j = 0 and i = 1
5.	$Define distance between two places \| \| y - x \| \|_{q} = {(\sum_{j = 1}^{N} \| y_{j} - x_{j} \|^{q})}^{1 / q}$
6	$Network utility function U_{K} (T) = \sum_{F \in F^{(K)}} \sum_{p \in P^{(K)}} y_{p f} (T) [α \cdot G_{p} (T) - β \cdot T_{p} (T)]$
7.	$Define routing problem M a x \sum_{T = 0}^{T} \sum_{K = 1}^{K} U_{K} (T) s . t . (1) (2) (3)$
8.	Update the final values
9.	$Define actual fitness function m s e = \sum_{j = 1}^{N} {(y_{j} - x_{j})}^{2} / N$
10.	End

5. Results and Evaluation

With the help of publicly available benchmark datasets like DupPR-basic and DupPR-complementary data, we validate our proposed HDL-ODPR model in this section. This model is implemented in the open-source Spyder cross-platform with the Python language. Our proposed HDL-ODPR model’s simulation results are contrasted with those of state-of-art models in terms of saved reviewing effort (SRE), F-measure, recall, precision, and accuracy.

5.1. Dataset Descriptions

5.1.1. DupPR-Basic Dataset

The profiles, review comments, and duplication relations between pull requests from 26 well-known open-source projects available on GitHub are all included in the DupPR-basic dataset. By examining the review comments from 26 well-known open-source projects hosted on GitHub, we create a unique dataset of more than 2000 pairs of duplicate pull requests. DupPR defines each pair of duplicates as proj, pr1, pr2, idtfcmt> in a quaternion.pr1 and pr2 denotes the item’s tracking numbers for the two pull requests, respectively. Item proj tells the project which items the duplicate pull requests are for. Reviewers use item idncmt, which is either a comment on pr1 or pr2 or both, to talk about how pr1 and pr2 are the same. The dataset was made to include duplicates that happened by accident and that none of the authors knew about when they made their own pull requests. We have a total of 1751 tuples of duplicate pull requests. Table 2 shows a summary of the DupPR-basic dataset in terms of numbers. It has a list of the most important metrics, like the number of pull requests, the people who worked on them, the reviewers, their review comments, and the checks.

5.1.2. DupPR-Complementary Data

GitHub uses DevOps tools like code-climate for static analysis and Travis-CI for continuous integration. After a pull request is made or changed, a set of DevOps tools are automatically run to see if it can be safely added back to the code base. The Github limit for authenticated requests is 5000 per hour, but the rate at which events are made is already higher. It is not realistic to think that a single Github account will be enough to mirror the whole dataset, because a single event can lead to multiple requests that depend on it. Because of this, GHTorrent was built from the ground up to use caching a lot and to be spread out (to enable multiple users to retrieve data in parallel). We give a quick look at the two ways that GHTorrent works. We mostly use the tables in GHTorrent that store the number of pull requests, pull-request issues, history, comments, and issue comments. Table 3 describes the summary of the DupPR-complementary dataset in terms of numbers.

5.2. Comparative Analysis

5.2.1. Performance Comparison with DupPR-Basic Dataset

In this scenario, we validate the proposed GDN-GS technique by using the standard benchmark DupPR-basic dataset. Table 4 describes the simulation results of our proposed GDN-GS technique with the GitHub repositories used in our evaluation process. The evaluation approach took into account 20 repositories, as shown in the table, and validated model by using various performance measures, including saved reviewing effort (SRE), F-measure, recall, precision, and accuracy. Table 5 describes the comparative analysis of our proposed and existing state-of-art techniques for the DupPR-basic dataset.

The accuracy of our proposed GDN-GS technique outperforms the existing PRs-text, PRs-text-similarity, PRs-improved, support vector machine, k-nearest neighbor, random forest, hierarchical clustering, and K-means clustering techniques by 5.65%, 11.29%, 16.96%, 22.58%, 28.26%, 33.87%, and 45.16%, respectively. The precision our proposed GDN-GS technique is more precise than existing techniques PRs-text, PRs-text-similarity, PRs-improved, support vector machine, k-nearest neighbour, random forest, hierarchical clustering, and K-means clustering, respectively, by 5.67%, 11.34%, 17.13%, 22.68%, 28.36%, 34.27%, 39.69%, and 45.39%, respectively. The recall of our proposed GDN-GS technique is 5.61%, 11.22%, 16.83%, 22.43%, 28.04%, 33.65%, 39.26%, and 44.87% more efficient, respectively, than the existing techniques in terms of PRs-text, PRs-text-similarity, PRs-improved, support vector machine, k-nearest neighbour, random forest, hierarchical clustering, and K-means clustering, respectively. The F-measure of our proposed GDN-GS technique is 5.64%, 11.28%, 16.92%, 22.56%, 28.20%, 33.84%, 39.48%, and 45.12% more efficient, respectively, than existing techniques, which include support vector machines, k-nearest neighbours, random forests, hierarchical clustering, and K-means clustering. SRE of proposed GDN-GS technique outperforms the state-of-the-art PRs-text, PRs-text-similarity, PRs-improved, support vector machine, k-nearest neighbour, random forest, hierarchical clustering, and K-means clustering techniques by 5.73%, 11.45%, 17.18%, 22.90%, 28.63%, 34.35%, and 45.80%, respectively.

5.2.2. Performance Comparison with DupPR-Complementary Data

By using common benchmark DupPR-complementary data, we validate the proposed GDN-GS approach in this case. The comparing of our proposed as well as existing state-of-the-art methodologies for DupPR-complementary data is shown in Table 6. The accuracy our proposed GDN-GS technique is more accurate and efficient than the existing techniques, PRs-text, PRs-text-similarity, PRs-improved, support vector machine, k-nearest neighbour, random forest, hierarchical clustering, and K-means clustering, by 9.76%, 15.16%, 20.56%, 25.96%, 31.36%, 36.76%, and 47.55%, respectively. The precision our proposed GDN-GS technique is more precise and efficient than existing techniques PRs-text, PRs-text-similarity, PRs-improved, support vector machine, k-nearest neighbour, random forest, hierarchical clustering, and K-means clustering by 8.65 percent, 14.14%, 19.64%, 25.13%, 30.62%, 36.11%, and 47.10%, respectively. The recall of proposed GDN-GS technique is 8.42%, 13.86%, 19.31%, 24.75%, 30.19%, 35.63%, 41.07%, and 46.51% more efficient than the existing techniques, which include PRs-text, PRs-text-similarity, PRs-improved, support vector machine, k-nearest neighbour, random forest, hierarchical clustering, and K-means clustering. The F-measure of proposed GDN-GS technique outperforms existing techniques such as PRs-text, PRs-text-similarity, PRs-Improved, support vector machine, k-nearest neighbour, random forest, hierarchical clustering, and K-means clustering by a factor of 8.54%, 14%, 19.47%, 24.94%, 30.40%, 35.87%, and 46.80%, respectively. The SRE of our proposed GDN-GS technique outperforms the state-of-the-art PRs-text, PRs-text-similarity, PRs-improved, support vector machine, k-nearest neighbour, random forest, hierarchical clustering, and K-means clustering techniques by 7.96%, 13.55%, 19.14%, 24.73%, 30.32%, 35.91%, and 47.09%, respectively.

6. Conclusions and Future Work

As soon as a pull request is submitted, it should be allocated to integrators for further investigation. Due to the number of pull requests in popular projects, integrators often have difficulty processing them on time. Additionally, it takes a lot of time to analyze such existing works, which leads to a growing number of pull requests. The time and effort involved in reviewing can be saved by using an automated duplicate detection method. An optimal pull-request duplication-detection model is proposed for OSS repositories by using a hybrid deep learning technique. A hybrid leader-based optimization (HLBO) algorithm is used to extract textual data from the pull requests. Multiobjective alpine skiing optimization (MASO) is used for the change similarity computation between pull requests. Finally, the hybrid GAN-GS technique is used for the pull-request duplicate detection. The proposed HDL-ODPR model is validated against the datasets such as DupPR-basic and DupPR-complementary data. From the simulation results, we showed that the effectiveness of proposed HDL-ODPR model is very high as opposed to the existing state-of-the-art models. The duplicate detection accuracy of proposed HDL-ODPR model is 92.644% and 96.87% achieved for standard benchmark datasets are DupPR-basic and DupPR-complementary, respectively. The proposed approach can be applied to a larger number of repositories from different social coding platforms in the future. The experimental results will be further improved upon by investigating other pull-request information or features.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Our proposed model is validated against the public standard benchmark datasets wihivh are DupPR-basic and DupPR-complementary data [21].

Conflicts of Interest

The author has declared that there is no conflict of interest.

References

Dinh-Trong, T.T.; Bieman, J.M. The FreeBSD project: A replication case study of open source development. IEEE Trans. Softw. Eng. 2005, 31, 481–494. [Google Scholar] [CrossRef]
Williams, C.C.; Hollingsworth, J.K. Automatic mining of source code repositories to improve bug finding techniques. IEEE Trans. Softw. Eng. 2005, 31, 466–480. [Google Scholar] [CrossRef]
Swedlow, J.R.; Kankaanpää, P.; Sarkans, U.; Goscinski, W.; Galloway, G.; Malacrida, L.; Sullivan, R.P.; Härtel, S.; Brown, C.M.; Wood, C.; et al. A global view of standards for open image data formats and repositories. Nat. Methods 2021, 18, 1440–1446. [Google Scholar] [CrossRef] [PubMed]
Curry, P.A.; Moosdorf, N. An open source web application for distributed geospatial data exploration. Sci. Data 2019, 6, 1–7. [Google Scholar] [CrossRef]
Ali, N.; Guéhéneuc, Y.G.; Antoniol, G. Trustrace: Mining software repositories to improve the accuracy of requirement traceability links. IEEE Trans. Softw. Eng. 2012, 39, 725–741. [Google Scholar] [CrossRef]
Tian, Y.; Tan, H.; Lin, G. Statistical properties analysis of file modification in open-source software repositories. In Proceedings of the International Conference on Geoinformatics and Data Analysis, Prague, Czech Republic, 20–22 April 2018; pp. 62–66. [Google Scholar]
Lowndes, J.S.S.; Best, B.D.; Scarborough, C.; Afflerbach, J.C.; Frazier, M.R.; O’Hara, C.C.; Jiang, N.; Halpern, B.S. Our path to better science in less time using open data science tools. Nat. Ecol. Evol. 2017, 1, 1–7. [Google Scholar]
Padhye, R.; Mani, S.; Sinha, V.S. A study of external community contribution to open-source projects on GitHub. In Proceedings of the 11th Working Conference on Mining Software Repositories, Hyderabad, India, 31 May–1 June 2014; pp. 332–335. [Google Scholar]
Gousios, G.; Vasilescu, B.; Serebrenik, A.; Zaidman, A. Lean GHTorrent: GitHub data on demand. In Proceedings of the 11th Working Conference on Mining Software Repositories, Hyderabad, India, 31 May–1 June 2014; pp. 384–387. [Google Scholar]
Rahman, M.M.; Roy, C.K. An insight into the pull requests of github. In Proceedings of the 11th Working Conference on Mining Software Repositories, Hyderabad, India, 31 May–1 June 2014; pp. 364–367. [Google Scholar]
Van Der Veen, E.; Gousios, G.; Zaidman, A. Automatically prioritizing pull requests. In Proceedings of the 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, Florence, Italy, 16–17 May 2015; pp. 357–361. [Google Scholar]
Zampetti, F.; Ponzanelli, L.; Bavota, G.; Mocci, A.; Di Penta, M.; Lanza, M. How developers document pull requests with external references. In Proceedings of the 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC), Buenos Aires, Argentina, 22–23 May 2017; pp. 23–33. [Google Scholar]
Jiang, J.; Lo, D.; Ma, X.; Feng, F.; Zhang, L. Understanding inactive yet available assignees in GitHub. Inf. Softw. Technol. 2017, 91, 44–55. [Google Scholar] [CrossRef]
Franco-Bedoya, O.; Ameller, D.; Costal, D.; Franch, X. Open source software ecosystems: A Systematic mapping. Inf. Softw. Technol. 2017, 91, 160–185. [Google Scholar] [CrossRef]
Dias, L.F.; Steinmacher, I.; Pinto, G. Who drives company-owned OSS projects: Internal or external members? J. Braz. Comput. Soc. 2018, 24, 16. [Google Scholar] [CrossRef]
Jiang, J.; Lo, D.; He, J.; Xia, X.; Kochhar, P.S.; Zhang, L. Why and how developers fork what from whom in GitHub. Empir. Softw. Eng. 2017, 22, 547–578. [Google Scholar] [CrossRef]
Pinto, G.; Steinmacher, I.; Dias, L.F.; Gerosa, M. On the challenges of open-sourcing proprietary software projects. Empir. Softw. Eng. 2018, 23, 3221–3247. [Google Scholar] [CrossRef]
Jarczyk, O.; Jaroszewicz, S.; Wierzbicki, A.; Pawlak, K.; Jankowski-Lorek, M. Surgical teams on GitHub: Modeling performance of GitHub project development processes. Inf. Softw. Technol. 2018, 100, 32–46. [Google Scholar] [CrossRef]
Li, Z.X.; Yu, Y.; Yin, G.; Wang, T.; Wang, H.M. What are they talking about? Analyzing code reviews in pull-based development model. J. Comput. Sci. Technol. 2017, 32, 1060–1075. [Google Scholar] [CrossRef]
Yu, Y.; Yin, G.; Wang, T.; Yang, C.; Wang, H. Determinants of pull-based development in the context of continuous integration. Sci. China Inf. Sci. 2016, 59, 080104. [Google Scholar] [CrossRef]
Li, Z.; Yin, G.; Yu, Y.; Wang, T.; Wang, H. Detecting duplicate pull-requests in github. In Proceedings of the 9th Asia-Pacific Symposium on Internetware, Shanghai China, 23 September 2017; pp. 1–6. [Google Scholar]
Yang, C.; Zhang, X.H.; Zeng, L.B.; Fan, Q.; Wang, T.; Yu, Y.; Yin, G.; Wang, H.M. RevRec: A two-layer reviewer recommendation algorithm in pull-based development model. J. Cent. South Univ. 2018, 25, 1129–1143. [Google Scholar] [CrossRef]
Hu, D.; Zhang, Y.; Chang, J.; Yin, G.; Yu, Y.; Wang, T. Multi-reviewing pull-requests: An exploratory study on GitHub OSS projects. Inf. Softw. Technol. 2019, 115, 1–4. [Google Scholar] [CrossRef]
Zhang, Y.; Wu, Y.; Wang, T.; Wang, H. iLinker: A novel approach for issue knowledge acquisition in GitHub projects. World Wide Web 2020, 23, 1589–1619. [Google Scholar] [CrossRef]
Yang, C.; Fan, Q.; Wang, T.; Yin, G.; Zhang, X.H.; Yu, Y.; Wang, H.M. RepoLike: Amulti-feature-based personalized recommendation approach for open-source repositories. Front. Inf. Technol. Electron. Eng. 2019, 20, 222–237. [Google Scholar] [CrossRef]
Nguyen, P.T.; Di Rocco, J.; Rubei, R.; Di Ruscio, D. An automated approach to assess the similarity of GitHub repositories. Softw. Qual. J. 2020, 28, 595–631. [Google Scholar] [CrossRef]
Yang, W.; Pan, M.; Zhou, Y.; Huang, Z. Developer portraying: A quick approach to understanding developers on OSS platforms. Inf. Softw. Technol. 2020, 125, 106336. [Google Scholar] [CrossRef]
Jiang, J.; Zheng, J.T.; Yang, Y.; Zhang, L. CTCPPre: A prediction method for accepted pull requests in GitHub. J. Cent. South Univ. 2020, 27, 449–468. [Google Scholar] [CrossRef]
Eluri, V.K.; Mazzuchi, T.A.; Sarkani, S. Predicting long-time contributors for GitHub projects using machine learning. Inf. Softw. Technol. 2021, 138, 106616. [Google Scholar] [CrossRef]
Jiang, J.; Zheng, J.; Yang, Y.; Zhang, L.; Luo, J. Predicting accepted pull requests in GitHub. Sci. China Inf. Sci. 2021, 64, 179105. [Google Scholar] [CrossRef]
Golzadeh, M.; Decan, A.; Legay, D.; Mens, T. A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments. J. Syst. Softw. 2021, 175, 110911. [Google Scholar] [CrossRef]
Li, Z.X.; Yu, Y.; Wang, T.; Yin, G.; Mao, X.J.; Wang, H.M. Detecting duplicate contributions in pull-based model combining textual and change similarities. J. Comput. Sci. Technol. 2021, 36, 191–206. [Google Scholar] [CrossRef]
Li, Z.; Yu, Y.; Zhou, M.; Wang, T.; Yin, G.; Lan, L.; Wang, H. Redundancy, context, and preference: An empirical study of duplicate pull requests in OSS projects. IEEE Trans. Softw. Eng. 2020, 48, 1309–1335. [Google Scholar] [CrossRef]

Figure 1. An overview of system design of proposed HDL-ODPR method.

Table 1. Summary of research gap.

Refs.	Year	Methodology	ML/DL Technique	OSS Repositories	Remarks	Research Gap
[22]	2018	Code review, pull-request prediction	SVM classifier	GitHub: Ruby on Rails and Angular.js.	Detection accuracy	Process for real-world issues may take a long time
[23]	2019	Multi-reviewing, pull-request prediction	Linear regression	Github REST API	Resolution latency	Not identifying and acquiring related issues
[24]	2020	Information retrieval, word and doc. embedding	Skip-gram model, PV-DBOW	GitHub:six projects (April 2018)	F-score, W-score and D-score	Affected by data dimensionality issues
[25]	2019	Recommendation process, pull-request prediction	Learning-to-rank (LTR)	GitHub: GHTorrent	Hit ratio, mean reciprocal rank	Different personality traits, educational backgrounds, and expertise levels
[26]	2020	Pull-request prediction	-	GitHub API: Java projects, DABLUE, CLAN	Success rate, Precision	Metadata is curated from different OSS forges
[27]	2020	Text, web data and code analysis	Portraitmodel	GitHub: TeslaZY	Precision	Not provide sufficient information
[28]	2020	CTCPPre, pull-request prediction	XGBoostclassifier	GitHub: 28 projects	Accuracy, precision, recall	Troublesome and time-consuming
[29]	2021	Pull-request prediction, LTC prediction	Naive Bayes, kNN, decision tree, and random forest	GitHub: REST API	Precision, recall, F1-score, MCC, and AUC	Not provide high detection rate
[30]	2019	Pull-request prediction	XGBoostclassifier	GitHub: 8 open source	Accuracy, precision	The complexity of modern distributed software development is rising
[31]	2021	Bot prediction, Pull-request prediction	Decision trees, Random forest, SVM, logistic regression, kNN	GitHub: ground truth dataset	Precision, recall and F1-score	Not taking into account the existence of bots, which lowers the detection’s precision

Table 2. Dataset description of DupPR-basic data.

Attributes	Number of Pull-Requests
Attributes	Overall	Duplicate
Number of pull-requests	333,200	3619
Number of contributors	39,776	2589
Number of reviewers	24,071	2830
Number of review comments	2,191,836	39,945
Number of checks	364,646	4413

Table 3. Dataset description of DupPR-complementary data.

Attributes	Number of Pull-Requests
Attributes	Overall	Duplicate
Number of pull-requests	1,144,251	3978
Number of contributors	29,978,291	2789
Number of reviewers	126,697	3145
Number of review comments	2,228,894	42,515
Number of issues comments	2,886,006	4878

Table 4. Simulation results of our proposed GDN-GS technique.

Repository Names	Pull-Requests	Performance Metrics (%)
Repository Names	Pull-Requests	Accuracy	Precision	Recall	F-Measure	SRE
angular/angular.js	31	92.530	92.107	93.145	92.621	91.230
facebook/react	15	92.542	92.119	93.152	92.633	91.242
twbs/bootstrap	47	92.554	92.131	93.164	92.645	91.254
symfony/symfony	33	92.566	92.143	93.176	92.657	91.266
rails/rails	25	92.578	92.155	93.188	92.669	91.278
joomla/joomla-cms	19	92.590	92.167	93.200	92.681	91.290
ansible/ansible	18	92.602	92.179	93.212	92.693	91.302
nodejs/node	15	92.614	92.191	93.224	92.705	91.314
cocos2d/cocos2d-x	3	92.626	92.203	93.236	92.717	91.326
rust-lang/rust	9	92.638	92.215	93.248	92.729	91.338
ceph/ceph	9	92.650	92.227	93.264	92.741	91.350
zendframework/zf2	9	92.662	92.239	93.272	92.753	91.362
django/django	3	92.674	92.251	93.284	92.765	91.374
pydata/pandas	3	92.686	92.263	93.296	92.777	91.386
elastic/elasticsearch	6	92.698	92.275	93.308	92.789	91.398
JuliaLang/julia	3	92.710	92.287	93.324	92.801	91.410
scikit-learn/scikit-learn	3	92.722	92.299	93.332	92.813	91.422
kubernetes/kubernetes	13	92.734	92.311	93.344	92.825	91.434
docker/docker	7	92.746	92.323	93.356	92.837	91.446
symfony/symfony-docs	19	92.758	92.335	93.368	92.849	91.458
Average		92.644	92.221	93.254	92.734	91.344

Table 5. Comparative analysis with DupPR-basic dataset.

Detection Techniques	Metrics (%)
Detection Techniques	Accuracy	Precision	Recall	F-Measure	SRE
GDN-GS	92.644	92.221	93.254	92.735	91.344
PRs-text [21]	87.414	86.991	88.024	87.505	86.114
PRs-text-similarity [32]	82.184	81.761	82.794	82.275	80.884
PRs-Improved [33]	76.954	76.531	77.564	77.045	75.654
k-nearest neighbor	66.494	66.071	67.104	66.585	65.194
Random forest	61.264	60.841	61.874	61.355	59.964
Support vector machine	71.724	71.301	72.334	71.815	70.424
Hierarchical clustering	56.034	55.611	56.644	56.125	54.734
K-means clustering	50.804	50.381	51.414	50.895	49.504

Table 6. Comparative analysis with DupPR-complementary dataset.

Detection Techniques	Metrics (%)
Detection Techniques	Accuracy	Precision	Recall	F-Measure	SRE
GDN-GS	96.87	95.23	96.12	95.6729	93.56
PRs-text [21]	87.414	86.991	88.024	87.505	86.114
PRs-text-similarity [32]	82.184	81.761	82.794	82.275	80.884
PRs-Improved [33]	76.954	76.531	77.564	77.045	75.654
Random forest	61.264	60.841	61.874	61.355	59.964
K-means clustering	50.804	50.381	51.414	50.895	49.504
Hierarchical clustering	56.034	55.611	56.644	56.125	54.734
Support vector machine	71.724	71.301	72.334	71.815	70.424
k-nearest neighbor	66.494	66.071	67.104	66.585	65.194

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Alotaibi, S.S. HDL-ODPRs: A Hybrid Deep Learning Technique Based Optimal Duplication Detection for Pull-Requests in Open-Source Repositories. Appl. Sci. 2022, 12, 12594. https://doi.org/10.3390/app122412594

AMA Style

Alotaibi SS. HDL-ODPRs: A Hybrid Deep Learning Technique Based Optimal Duplication Detection for Pull-Requests in Open-Source Repositories. Applied Sciences. 2022; 12(24):12594. https://doi.org/10.3390/app122412594

Chicago/Turabian Style

Alotaibi, Saud S. 2022. "HDL-ODPRs: A Hybrid Deep Learning Technique Based Optimal Duplication Detection for Pull-Requests in Open-Source Repositories" Applied Sciences 12, no. 24: 12594. https://doi.org/10.3390/app122412594

APA Style

Alotaibi, S. S. (2022). HDL-ODPRs: A Hybrid Deep Learning Technique Based Optimal Duplication Detection for Pull-Requests in Open-Source Repositories. Applied Sciences, 12(24), 12594. https://doi.org/10.3390/app122412594

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HDL-ODPRs: A Hybrid Deep Learning Technique Based Optimal Duplication Detection for Pull-Requests in Open-Source Repositories

Abstract

1. Introduction

2. Related Work

3. Problem Methodology and System Design

3.1. Problem Methodology

3.2. System Design

4. Proposed Methodology

4.1. Textual Data Extraction from Pull Requests

4.2. Compute Similarities

4.3. Pull Requests’ Duplicate Detection

5. Results and Evaluation

5.1. Dataset Descriptions

5.1.1. DupPR-Basic Dataset

5.1.2. DupPR-Complementary Data

5.2. Comparative Analysis

5.2.1. Performance Comparison with DupPR-Basic Dataset

5.2.2. Performance Comparison with DupPR-Complementary Data

6. Conclusions and Future Work

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI