Effort-Aware Fault-Proneness Prediction Using Non-API-Based Package-Modularization Metrics

Shaikh, Mohsin; Tunio, Irfan; Khan, Jawad; Jung, Younhyun

doi:10.3390/math12142201

Open AccessArticle

Effort-Aware Fault-Proneness Prediction Using Non-API-Based Package-Modularization Metrics

¹

Department of Computer Science, The University of Larkano, Larkana 77062, Pakistan

²

Department of Electronics Engineering, The University of Larkano, Larkana 77062, Pakistan

³

School of Computing, Gachon University, Seongnam 13120, Republic of Korea

^*

Authors to whom correspondence should be addressed.

Mathematics 2024, 12(14), 2201; https://doi.org/10.3390/math12142201

Submission received: 13 May 2024 / Revised: 8 July 2024 / Accepted: 10 July 2024 / Published: 13 July 2024

(This article belongs to the Special Issue Application of Artificial Intelligence in Decision Making)

Download

Browse Figures

Versions Notes

Abstract

:

Source code complexity of legacy object-oriented (OO) software has a trickle-down effect over the key activities of software development and maintenance. Package-based OO design is widely believed to be an effective modularization. Recently, theories and methodologies have been proposed to assess the complementary aspects of legacy OO systems through package-modularization metrics. These package-modularization metrics basically address non-API-based object-oriented principles, like encapsulation, commonality-of-goal, changeability, maintainability, and analyzability. Despite their ability to characterize package organization, their application towards cost-effective fault-proneness prediction is yet to be determined. In this paper, we present theoretical illustration and empirical perspective of non-API-based package-modularization metrics towards effort-aware fault-proneness prediction. First, we employ correlation analysis to evaluate the relationship between faults and package-level metrics. Second, we use multivariate logistic regression with effort-aware performance indicators (ranking and classification) to investigate the practical application of proposed metrics. Our experimental analysis over open-source Java software systems provides statistical evidence for fault-proneness prediction and relatively better explanatory power than traditional metrics. Consequently, these results guide developers for reliable and modular package-based software design.

Keywords:

software maintenance; package-level code analysis; fault-proneness prediction

MSC:

68T35

1. Introduction

Software design is one of the most important phases in software development. The careful implementation of design decisions is vital for the maintainability, testability, and reliability of software systems. The maintainable design of software is an inevitable objective for any organization to satisfy its ever-changing clients’ requirements and continuous maintenance. The maintenance phase typically entails the modification, addition, and removal of source code entities (classes, methods); however, such activities also incur a gradual deficit in the quality of software architecture [1,2,3]. For a long time, classes were considered fundamental architectural or design constituents in OO software systems. This approach of software design remains inadequate for mitigating challenging dependency management, increasing complexity, high maintenance cost, and operational risks due to fragile architectural modularization [4,5,6].

To provide insight, we highlight the following limitations of class-level metrics:

As explained earlier, historically coupling and cohesion-based metrics have been considered possible quality indicators at class level. It typically narrows the focus since these metrics can quantify a single class, ignoring the large architectural view of code such as packages, and while class-level metrics are useful to a specific class, they are incapable of capturing broader interactions and dependencies that exist within the same package and across the package in a software system. Therefore, class level does not help to comprehend fault-prone areas of systems completely, unlike packages.
In the context of identifying the fault-prone spots in software systems, inspecting the core issues of code base (improper modularization, cyclic dependencies, abstraction of classes) at the architectural level is only possible through package-level metrics. On the contrary, classes fail to provide such information as their structures do not provide higher-level organization views of code, unlike packages. Therefore, relying on solely class-level metrics may lead to overlooking the weakness at the architectural level.
The theme of our research is to actually detail the inadequate representation of dependency interaction formed in software systems. Software systems are composed of numerous interacting classes that span across the packages and can only be identified through package-level quantification. However, undermining these inter-package and intra-package dependencies in fault-prediction models would give inappropriate results.

It is important to understand, assess, and manage the structural components of software systems for ensuring flexible and quality-oriented design. An efficient way to address the aforementioned design issues is to develop a system with a higher level of abstraction by grouping classes into coarse-grained entities, i.e., packages. This coding norm of splitting up classes (preferably with uniform task) into separate packages improves design and development, unlike the orthodox approach of class-based OO structure [7,8,9]. Good packaging has been claimed to ease the understanding, maintenance, testing, and evolution of software systems through the proper organization of classes and allowing service-flow mechanisms among packages [10,11]. Packages are modular and organizational units of many modern object-oriented programming languages, like Java, Small-Talk, and Ada [12]. During the re-factoring process of source code, changes in architectural and functional parts of a package are quite obvious [13,14]. The entities (e.g., classes, methods, attributes) within the packages are interactive, thereby affecting its internal and external dependencies. Therefore, the addition of dependencies, classes, and replacement of entities during the restructuring process often affect package cohesion and coupling properties.

In the last decade, modularity-based source code composition and architectural strength using packages have been active research subjects in the domain of software quality [15,16,17,18]. It is worth stating that Martin’s proposed metrics suite to measure the quality of packages in terms of scalability, abstraction, cohesion, reusability, and maintainability is considered a pioneering effort [19]. Mood mapped out another well-known metric by upgrading class-level measures to package structurally [20,21,22,23]. Elish et al. appraised the statistical significance of Martin’s and Mood’s metric suites through the correlation and regression analysis of package faults found before and after the software release [24]. Sarkar et al. put forward the novel definitions of modularization metrics that characterize different aspects of software modularization such as internal/external coupling, cohesion based on segregation and functional unity, inter-package coupling, fragility of base classes present in different packages, etc. [25]. The theoretical foundation of Sarkar’s metrics is based on the theme that software components, in particular packages, interact with each other through application programming interfaces (APIs). Further, Zhao et al. recently presented an empirical analysis of Sarkar’s metrics suite for their effectiveness towards fault-proneness prediction [26]. Having recognized the application of packages for ensuring sustainable software architecture, there is a need to further evaluate software quality based on packages using quality estimation techniques, especially fault-proneness prediction.

Software systems are often vulnerable to aging effects in case their adaption is avoided due to ever-changing technological requirements. Eventually, these systems would be expensive in terms of maintenance, testing, evolution, reusability, interoperability, and flexibility. Conventional organizations of a large scale frequently allow running years-old computer programs to continue their business operations uninterrupted, and consequently, that program becomes legacy software systems. Complications of legacy software systems keep aggravating due to the presence of rigid hierarchies in their code base, making testing and inspecting troublesome. As a matter of fact, in the typical approach of the business enterprise to resist change in their operational activities, unhandled architectural dependencies may arise, threatening the functional failure of software systems. Legacy systems often become non-productive due to their lagging adaptability with modern technological changes, hardware, software upgradation, and the continuous tracking of the software-evolution process.

Nevertheless, legacy software systems cannot be discarded because of their economic significance. Despite valuable existing research work for assessing software quality with packages, there is still an ongoing effort to explore further aspects of package modularity. In particular, legacy object-oriented systems that are not developed on modern design pattern paradigms, like explicit declaration of application programming interfaces (APIs), require illustration of architectural quantification. There are legacy software systems coded using languages like, Java, small-talk, and Ada with classes at a lower level of source code granularity contained in packages as subsystems or modules.

The architectural quality of software systems is extensively dependent on design and source code artifacts of different granularity levels (packages, classes, methods, etc.). Structured and organized programming components are helpful for easier maintenance, evolution, reliability, scalability, and portability of software systems. However, software that is not updated, modified, and re-factored according to modern technological needs is vulnerable to becoming legacy software systems due to architectural erosion and gradual design deterioration. In addition, they lack declared API for reusability and re-engineering, making source code complex and liable for the organizations running that software. Importantly, the structured organization of packages with identifiable components and effective collaboration with other source code units of software design ensures easier maintenance, timely evolution, and robust testing. In modern software systems, the package-based design assumes an explicit declaration of API-adhering object-oriented programming norms, i.e., inheritance, encapsulation, and polymorphism.On the contrary, legacy systems that are often found with the absence of declared APIs are essentially quantified (defined with metrics) using additional interpretations and heuristics.

In this context, H. Abdeen et al. proposed a complementary set of package-level metrics based on fundamental object-oriented principles: information-hiding, changeability, and reusability [27]. The application of these metrics has a main focus on legacy object-oriented systems with packages being units of software modularization. Proposed metrics emphasize the role of a package as a provider of well-identified services to the rest of the software components which can be accomplished through method call and interacting classes at the lowest level of granularity. It was also reported how source code quality in terms of goal focus and service cohesion can be achieved with proposed package-modularization metrics. However, further investigation of this modularization for fault-proneness, especially in effort-aware modeling, may validate their usefulness.

This paper explores the utility of H. Abdeen et al.’s package-modularization metrics for predicting the fault-proneness of packages in object-oriented software systems. We specifically examine effort-aware prediction models developed with these package-modularization metrics in terms of classification and ranking scenarios. An experimental study is conducted over open-source software systems. This paper provides empirical evidence-based insights into the application of package modularization for identifying fault-prone packages in legacy object-oriented systems. This approach can be helpful for the efficient and effective allocation of resources to ensure quality software architecture.

The following are key contributions of our research:

Theoretical framework: We present the theoretical illustration of package-level modularization metrics comprehensively. These metrics objectively explain and formulate non-API-based object-oriented principles: commonality-of-goal, changeability, maintainability, and analyzability. This framework forms a novel perspective to study and evaluate the object-oriented systems.
Empirical evaluation: Our research has the main focus of empirically assessing the relationship between package-level metrics and fault-proneness. We have employed robust statistical techniques of correlation and logistic regression.
Effort-aware mechanism: Our experimental work is rigorously carried out with the application of effort-aware classification and effort-aware ranking, ensuring not only statistical impact, but practical viability in real-world scenarios of software development, testing, and debugging.
Guidance for package-based design: Results and findings of our study can provide guidance for reliable and maintainable package design of software systems. Moreover, results can be helpful in re-engineering legacy systems, setting a benchmark for fault resilience and refactored package design.

The rest of this paper is structured as follows: Section 2 explains the most related work; a detailed description of these metrics is summarized in Section 3; the empirical study is illustrated with a brief analysis of sub-sections in Section 4; threats to validity are mentioned in Section 5 followed by a discussion as Section 6 and a conclusion of the study in Section 7.

2. Related Work

There have been notable research efforts devoted to defining the approaches of fault prediction using packages [28,29]. Zimmermann et al. utilized CK metrics to identify fault-prone packages which is considered the earliest effort in this domain [30]. Kem et al. proposed an effort-aware model to enhance the reliability of fault-prediction approaches [31]. Chao NI et al. discovered that package-level prediction models outperform traditional models using effort-aware ranking [32]. In their study, they used random forest algorithm to build prediction models with process-level metrics. We list some of the research gaps in the above-mentioned studies that have not addressed fault prediction at different levels of artifacts:

1.: Scope of CK Metrics in Fault Proneness Prediction
Zimmermann et al. utilized CK metrics to identify fault-prone packages, still considered a pioneering attempt. However, it centered evaluation primarily on class-level metrics aggregated to the package level [30]. A. Dalal et al. also informed that despite the discriminatory power of class-level metrics, it does not provide architectural views and fragility at a higher level of granularity [33]. A recent study by Zheng et al. also reported that the baseline used in class-level prediction is not sufficient enough to draw a conclusion with a broader scope [34]. Moreover, most studies related to class-level fault prediction only use CK metrics remaining under critical focus and diminishing validity [35]. This approach may miss specific package-level structural issues, such as inter-package dependencies and architectural smells, that our study aims to address with more targeted package-modularization metrics.
2.: Effort-Aware Models lacking utility
Kem et al. proposed an effort-aware model to enhance fault-prediction reliability with limited contextual application. Their approach did not extensively explore the application of package-level metrics [31]. Our research fills this gap by integrating package-level metrics into effort-aware models, demonstrating their utility in a practical context. In an expansion of their work, we introduced effort-aware ranking and classification as two types of modeling techniques.
3.: Insufficient Exploration of Non-API-Based Metrics
Chao NI et al. and others have shown that package-level prediction models can be productive against class-level metrics [32]. Our study goes further by focusing on non-API-based package-modularization metrics, providing a new perspective on fault-proneness prediction. Moreover, metrics quantifying the packages differ in definition and mathematical model than metrics quantifying classes. These dimensions of categorized package level seem to be ignored or missing in recent studies. Despite the empirical significance and statistical evidence, past studies have either focused on theoretical explanations or did not discuss their application to the architectural view of software systems [36,37].
4.: Lack of Empirical Validation
Recent studies by Zhao et al. and Yang et al. have laid the groundwork for exploring software maintenance with package-modularization metrics [26,29]. As a matter of fact, recent studies related effort-aware prediction modeling to elaborated performance indicators and holistic theoretical framework [38,39]. It is quite worth noticing that validating the non-API metrics was never part of their studies. Their studies were limited to empirical validation of conventional metric types, i.e., package cohesion and package coupling. However, there is a need for the empirical validation of API-based or non-API-based metrics in real-world scenarios. Our research aims to provide this validation, offering statistical evidence for the practical application of these metrics.

H. Abdeen et al. introduced a composite metric suite for non-API-based legacy software systems, focusing on package quality and modularization. However, their studies did not explicitly address fault-proneness prediction using these metrics. Our work extends their research by applying these metrics to fault-proneness prediction, highlighting their practical utility.

Nevertheless, there have been certain crucial findings in prior studies that led to the synthesis of key insights matching with our research theme. Following are salient features, which we can be utilized to elaborate our research:

Lifting up traditional mechanism: Zimerman et al. explored that the conventional approach to fault prediction using class-level metrics may not be convincing. In their research, lifting classes into packages and then aggregating the faults for package-level fault prediction was an initial approach

Introduction of Effort-Aware Model: Advancement in the mechanism of evaluation models and incorporating effort-aware techniques made a significant contribution towards the novelty of the fault-prediction domain. Admitting this fact, Kem et al. conducted the study to incorporate effort-based prediction in experimental analysis and paved the way for package-level metrics to build on this concept [31].

In summary, our research application and empirical validation of non-API-based package-modularization metrics addresses the identified gaps as listed above. We believe the application of our research theme in real-world scenarios covers the shortcomings of previous studies efficiently. Furthermore, exploring key areas, like, diverse dimensions of reliability, maintainability, software design, and software re-engineering is focus of our research.

However, recent research studies by Zhao et al. and Yang et al. on the fault-proneness prediction of packages is a state-of-the-art towards exploring software maintenance with package-modularization metrics practically [26,29]. More importantly, their studies form the basis of motivation for us to further align our research in the same direction. Recently, H. Abdeen et al. and Stephen Ducas et al. carried out substantial research work towards different areas of source code maintenance (http://stephane.ducasse.free.fr/, accessed on 10 Januray 2024): re-modularization, metrics quality, and cycle and layer identification. Interestingly, the package component of the OO programming design aspect has been their major focus. Peculiarly, the research studies of H. Abdeen et al. revolved around subjects like enhancing package coupling and cycle minimization through package quality; modularization metrics for legacy systems; visual comprehension of package relationship in source code; and improving package structure without affecting design parameters adversely [5,16,40,41]. Mohsin et al., taking the same direction, conducted a comprehensive empirical study for multiple maintenance tasks using package quality metrics proposed by H. Abdeen et al. [42]. It is quite pertinent to describe that later the study of H. Abdeen et al. mainly described a composite metric suite for a non-API-based legacy software system [27]. Taking the same notion of determining software quality through fault-proneness prediction, it is quite desirable to find statistical evidence for the utility of these metrics, taking into account effort as an evaluation measure.

3. Methodology

3.1. Description of Package-Level Metrics under Investigation

Table 1 provides a summary of package modularization proposed by H. Abdeen et al. All these metrics were designed considering the role of a package as a client and provider in modularization design. These metrics basically determine the modular quality of structural properties in legacy object-oriented software systems, where application programming interface (API) functionality is implemented through the interaction of classes among the packages. Therefore, due to the commercial importance of legacy object-oriented systems, it is necessary to maintain their design according to modern object-oriented principles. Mainly, these principles have been devised with goals of conformance with minimum communication and dependency among packages, appropriate size (methods/classes) of packages, packages being service provider entities, and packages designed with a consistent goal.

IPCI (Package Changing Impact Index) It measures the extent to which the impact of change on a particular package is realized due to inter-package dependencies.
IIPUD (Inter-Package Usage Diversion Index): It measures the extent to which communication diversion or association among the packages is produced without inheritance relationship.
IIPED (Inter-Package Extending Diversion Index): It measures the extent to which usage communication or inheritance among the packages is produced.
PF (Package Goal Focus Index): It measures the extent and frequency to which the services of packages are required by other client packages.
IPSC (Package Service Cohesion): It measures the extent of similarity of the purpose of the service class or cohesiveness of composite service for the package to be fulfilled.

3.2. An Illustrated Example of Package Design

We provide a comprehensive illustration of non-API-based package-level metrics proposed by H. Abdeen et al. through an example shown in Figure 1. As can be seen, there are basically six (6) packages, whereas our focus package is P. In order to present the example for the simplicity, convenience, and clarity of readers, we have shown Figure 1 as abstract as possible. In this simple package, design classes are represented with class diagrams denoted with C1, C2, …, Cn and dotted arrows represent use dependencies across and within the packages while solid arrows indicate existence of inheritance or extend relationships across or within the packages.

IPCI for package P would account for the outgoing dependencies of classes C2, C6, and C7 towards packages P1, P2, and P5. More outgoing arrows indicate higher-value IPCI. IIPUD refers to the communication of package P within other packages through use relationships. Each dotted line from P1, P2, P3, P4, to P5 shall calculate the factor of IIPUD. Similarly, IIPED accounts for the extent to which another package, P1, P2, P3, or P4, utilizes package P through inheritance or the extend relationship. It is quite visible from Figure 1 that P1 and P2 are connected to P using the extend relationship through classes C11, C21, and C22. The PF metric would assess the services provided by P or used together with other packages. In the context of the figure, classes C2 and C3 are serving the extend relationship to C11 of package P1 and C21 of package P2. IPSC determines the similarity of purpose services between and among the packages. For the package P class, C1 and C2 are examples of leveraging common services of inheritance. This relationship is built in the form of composition from packages P1 and P2.

Generally, the block diagram illustrates the relationships between classes in different packages. By understanding these relationships, we can reason about the factors that would influence the modularization metrics proposed by Abdeen. However, calculating the specific metric values would require static analysis and the parsing of source code and script, as defined in the following sections.

3.3. Description of Robert Martin’s Metrics

Table 2 lists the description of notations used in Table 1 for the better comprehension of metrics. The values of these metrics range between 0 and 1, implying that the larger the value of the metric, the better the modularization quality of the package would be. Table 3 provides definitions of well-known R.C Martin metrics, which can help to measure package quality at an early stage of development. Moreover, R.C Martin metrics are often set as baseline in fault prediction-related research studies involving packages [26,43].

We can observe from the definitions in Table 3 that object-oriented properties, like abstractness, independence, responsibility, and extensibility of package components are quantified by metrics A, Ce, Ca, and N, respectively, whereas I indicates the extent to which the package module is resilient against structural change, while D represents balance produced by the package between stability and abstractness. It can be extracted that the package becomes more extendable with higher values of N and Ca, and it becomes more independent with a lower value Ce. Similarly, the lowest value (0) of I indicates complete instability of the package and the highest value (1) of I represents the complete stability of the package.

The Empirical Study

In this section, we aim to investigate the ability of H. Abdeen’s package-level metrics to predict the faults in a legacy object-oriented system compared with another recognized traditional metrics suite. In the larger picture, our research study explores the influence of package-level metrics over the quality and maintainability of the source code. In order to carry out an empirical evaluation of these metrics, the Martin metrics suite described in Table 3 is set as baseline in our experimental context. As a matter of fact, package-level metrics proposed by Robert Martin have been considered the standard benchmark in the prevalent research literature, and all other metrics are either studied against Robert Martin metrics or evaluated in combination [26,30,44]. This fact is in line with many studies related to fault prediction using class-level metrics where CK metrics are set as the standard quantification mechanism for class cohesion, coupling, etc. [35]. In the broader picture, H. Abdeen metrics or other proposed metrics assume specific design features and aspects (e.g., API-based, non-API-based, multilevel packages, etc.); therefore, research practice of combining newly formulated package-level metrics and traditional Robert Martin metrics is frequent to obtain better insight. Specifically, our study explores the practical value of H. Abdeen’s modularization metrics from the perspective of the following research questions.

RQ1: Do these metrics present unique and useful aspects of package-modularization metrics in relation to the traditional Martin metrics suite?
RQ2: Do these metrics, together with traditional metrics, improve the fault-proneness prediction of packages in effort-aware modeling scenarios?

These research questions are devised to help software practitioners understand the importance of incorporating package modularity in an object-oriented software design. The idea behind RQ1 is to determine whether there is redundancy between traditional metrics and the H. Abdeen modularization metrics suite. More importantly, RQ1 tends to explore the H. Abdeen metrics in terms of statistical and quantitative application. The objective of RQ2 is to determine the application of H. Abdeen’s modularization metrics towards fault-proneness prediction, taking into account the evaluation measure of the effort invested. To be more precise, effort-aware modeling techniques are sound and sufficient enough for evaluating the cost-effectiveness of metrics for predicting faulty modules.

3.4. Data Processing

Figure 2 shows the information-gathering mechanism from CVS or subversion repositories and defect-tracking systems, i.e., Eclipse (https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/, accessed on 13 Januray 2024), JIRA (https://issues.apache.org/jira/secure/Dashboard.jspa, accessed on 13 Januray 2024), and Promise (http://openscience.us/repo/issues/bugfiles.html, accessed on 13 Januray 2024). A commercial tool Understand (https://scitools.com/, accessed on 13 Januray 2024) is utilized to parse source code of open-source software systems. We developed our perl-API (https://support.scitools.com/support/solutions/articles/70000582858-getting-started-with-the-perl-api, accessed on 13 Januray 2024) script to query the Understand database to compute H. Abdeen’s metrics. Data analysis and modeling is performed using statistical tool R (https://www.r-project.org/, accessed on 13 Januray 2024). For replication purposes, we have made data and code related to this study available on public site (https://github.com/Analyzer2210cau/Effort-Aware-Package-Metrics, accessed on 13 Januray 2024).

3.5. Evaluation Mechanism

In order to obtain reliability and effectiveness in our experimental investigation, we have utilized a well-known methodology of cross-validation evaluation. In cross-validation, the Dataset is essentially partitioned in random n-folds, then one partition is set as a testing set and all other partitions are used as a training set. The testing set with each partition shall have a predicted label (dependent variable), whereas the training set shall eventually be carrying independent variables. Furthermore, training and testing partitions are processed under the settings of n-folds, and n-times repeatedly to obtain predictive insights rigorously. In our case, we have employed a 10-fold, 10-time cross-validation mechanism over each Dataset (dividing the Dataset into 90% training and 10% testing then changing the test and train Dataset at each iteration) to acquire unbiased evaluation results. This is common practice in research pertaining to fault prediction of software systems [9,30]. This process is expected to provide realistic estimation and reduce bias and hence reach effective prediction models generated using multivariate logistic regression.

Figure 3 provides a visual representation of the entire evaluation methodology. Basically, our methodology is completed in four phases, i.e., data-gathering, static analysis, Dataset formation, and performance analysis. As explained earlier, rigorous, incremental, and continuous testing of code and Dataset is required to obtain precision in results.

3.6. Modeling Technique

In this section, we illustrate the theoretical background of techniques used to obtain results in our experiments. Additionally, these techniques are further described from the perspective of fault-proneness prediction using H. Abdeen’s metrics.

3.6.1. Correlation Analysis

Correlation analysis is a method of statistical evaluation used to study the strength of a relationship between two variables. Spearman’s rank correlation bears wide usage in research studies for determining the association between independent variables (software metrics) and actual faults. In this study, correlation significance is computed at confidence interval levels 95% and 90%, i.e., p-value < 0.05, p-value < 0.1, and p-value < 0.2. This analysis mainly helps to investigate the RQ1 in determining the extent and significance of the relationship between studied metrics and post-release faults.

3.6.2. Multivariate Logistic Regression

Statistic logistic regression is a technique for modeling the probability of prediction. The independent variables, when modeled, can output values between 0 and 1 after being fitted to a logit function. A multivariate logistic regression model can simply be described using the following equation:

P r (Y = 1 | X_{1}, X_{2}, \dots, X_{n}) = \frac{e^{α + β_{1} X_{1} + β_{2} X_{2} + \dots β_{n} X_{n}}}{1 + e^{α + β_{1} X_{1} + β_{2} X_{2} + \dots β_{n} X_{n}}}

In the context of our experimental setup,

X_{1}, X_{2}, \dots, X_{n}

are independent variables, i.e., the metric value of H. Abdeen’s package-level metrics or R.C Martin package-level metrics,

P r (Y = 1 | X_{1}, X_{2}, \dots, X_{n})

represents the probability that the outcome of a dependent variable becomes

Y = 1

, i.e., the package being predicted as faulty using characteristic H. Abdeen’s modularization metrics. We evaluate the performance of a fault-prone prediction model with several strategies considering usage scenario and maximum reliability of prediction. Multivariate logistic regression is experimentally important to find an answer for RQ2. We deploy this technique to build two types of models: (1) the “T” model (using Martin’s metric suite) and (2) the “H.abdeen + T” model (using H. Abdeen modularization in combination with traditional Martin’s metric suite). Before building these models, multi-collinearity among the independent variables with a variance inflation factor (VIF) less than 10 is detected. Thus, variables having VIF ≥ 10 are excluded for modeling the prediction in order to avoid redundancy and ensure experimental worth. Afterward, 10-time 10-fold cross-validation logistic regression is applied to evaluate comparative effectiveness between models: “T” and “H. Abdeen + T”.

In order to investigate RQ2, two typical application scenarios are employed: classification and ranking. In the ranking evaluation method, packages are ranked in descending order according to the degree of their predictive risk. Thereupon, with this rank of packages in hand, software managers presumably can allocate their testing and inspection resources to “priority identified” packages. In classification scenarios, the division of package with relative predictive risk into “high-risk” and “low-risk” takes place. As a desirable practice, developers would aim to fix the “high-risk”-predicted packages with the least cost, thereby making prediction effort-aware. Previously, many studies were recommended to assess the cost-effectiveness of prediction models [36,45]. Recently, there has been considerable emphasis and focus on the application of effort-aware performance measures to evaluate prediction models effectively and rigorously [46,47]. In the following, we further elaborate on effort-aware performance indicators used in our study.

3.6.3. Effort-Aware Ranking

In the ranking, the cost-effectiveness of a fault-prone module is computed using source lines of code (SLOC) as a proxy. Effort-aware ranking is basically based upon the “SLOC-Alberg diagram” as shown in Figure 4. In Figure 4, the x-axis represents the cumulative percentage of packages selected from the ranking list and the y-axis represents the cumulative percentage of faults found in these selected packages. Additionally, the curve for the optimal and prediction models is also shown in this diagram.

3.6.4. Effort-Aware Classification

In classification, effort reduction is the most popular technique for performance indication. By definition, ER denotes the ratio of the reduced source lines of code to inspect or test by using a classification model compared with random selection to achieve the same recall of faults. In order to obtain computational reliance, the study by Zhao et al. proposed to set the prediction threshold as a prerequisite in an effort-aware classification scenario [26]. Zhao et al. mainly presented certain performance benchmarks as parameters for evaluating effort-aware prediction, i.e., Balanced-Pf-Pd-Metric (BPP) and Maximum-F-measure (MFM). The BPP method leverages a Receiver Operating Curve (ROC) for prediction by setting “balance” as a classification threshold, whereas the MFM method chooses the F-measure (harmonic mean of precision and recall) as threshold while training the Dataset. BPP particularly utilizes ROC in the context of binary classification. The ROC curve is a graphical plot that illustrates binary classification system diagnostic, i.e., the prediction of the package being faulty or non-faulty. Balanced-Pf-Pd is more accurately defined as a balance between Positive Fraction and Positive Difference. In this context, ROC sets the threshold-balancing sensitivity and specificity in an effective manner.

The following are fomulas for BPP and MFM:

$B P P = 1 - \sqrt{{(0 - p f)}^{2} + {(1 - p d)}^{2}} / \sqrt{2}$ whereas pf denotes the probability of false positives and pd denotes the probability of faults detected.
$M F M = 2 \times R e c a l l \times P r e c i s s i o n / (R e c a l l + P r e c i s s i o n)$ . Recall is calculated as the ratio of packages correctly classified as the faulty to total number of packages in a Dataset. Precision is calculated as a ratio of packages correctly classified as the faulty to total number of packages classified as faulty in a Dataset.

ER-MFM and ER-BPP indicate effort reduction under BPP and MFM metrics described above. ER-based models are generally cost-efficient classification mechanisms. We tend to adopt the same strategy in order to ensure that H. Abdeen’s modularization metrics may not conflict with existing experimental settings of traditional metrics.

3.7. Datasets

For our study, we have utilized data collected from different types of open-source Java software systems as shown in Table 4. Specifically, in selecting these Datasets as our subject systems, we have considered different factors, i.e., open-source software of varying nature and diverse domain from application perspective, source code having large and small numbers of packages and recognition in the research literature of fault-proneness prediction. These Datasets include:

Eclipse (https://eclipse.org/, accessed on 14 Januray 2024): an Integrated Development Environment (IDE) for software development widely used in collaborative and corporate industries.
POI (https://poi.apache.org/, accessed on 14 Januray 2024): a tool aimed at creating and maintaining Java APIs for manipulating various file formats.
Lucene (https://lucene.apache.org/, accessed on 14 Januray 2024): a high-performance text search engine enriched with many features and coded in Java.
Camel (http://camel.apache.org, accessed on 14 Januray 2024): a Java API for defining routing and mediating rules of domain-specific languages.
JDTCore (https://eclipse.org/jdt/core/, accessed on 14 Januray 2024): an incremental Java compiler for Eclipse IDE.

Additionally, we initially considered the following Datasets:

Ant (https://ant.apache.org/, accessed on 14 Januray 2024): a Java library and command-line tool whose mission is to drive processes described in build files as targets and extension points dependent upon each other.
JEdit (https://www.jedit.org/, accessed on 14 Januray 2024): a mature programmer’s text editor with extensive development efforts behind it.

However, JDTCore-4.2, Ant-1.6, and JEdit-4.2, JEdit-4.3 were discarded during the experimental analysis phase. The primary reason for excluding these Datasets was that they contained a relatively small number of faulty packages, which made them unsuitable for inclusion in the research control group. The limited presence of faulty packages in these Datasets could lead to statistical biases and unreliable results, and hence their exclusion was necessary to maintain the integrity and reliability of our experimental analysis.

By focusing on Datasets with a sufficient number of faulty packages, we ensure that our fault-prediction models are robust and our findings are statistically significant. This selection process allows us to draw more accurate and meaningful conclusions about the effectiveness of package-modularization metrics in fault prediction.

Datasets are formed with a matrix format that consists of six metrics by R.C Marin [19], five package-modularization metrics formulated by H. Abdeen et al. [27], and a corresponding number of faults in the package of software systems.

Table 4 provides descriptions of all the Datasets having columns of the names of software systems, versions used, number of total packages, total number of faults, number of major faults, percentage of faults, inter-package non-API method, and source lines of code represented in 1000 s, respectively. To elaborate, from a programming perspective, non-API-based package dependencies are formed through concrete methods not having any abstract, Java standard library, or implicit references, whereas source lines of code in a 1000 s (KLOC) metric measuring size of the subject system are used to devise effort-aware ranking models. It can be extracted from detailed structural information of subject systems that non-API method calls are abundant in all Datasets. This culminates in inter-package calls being subject to systems routed extensively through non-API methods. Another advantage of using these Datasets is the availability of their fault data on public forums and sites like Eclipse Fault Data (https://www.st.cs.uni-saarland.de/softevo/bug-data/eclipse/, accessed on 14 Januray 2024) and PROMISE (https://openscience.us/repo/, accessed on 15 March 2024). Therefore, datasets form multiple aspects of architectural views to produce meaningful statistical conclusions. Our experimental study is designed to encompass comprehensive elaboration of research questions using these Datasets.

Figure 5 presents a box plot to illustrate the distribution of H. Abdeen’s modularization metrics in studied Datasets. Figure 5 specifically shows the representation of metrics as the box-plot distribution within the 25th and 75th percentile, and the median value is shown as a horizontal line. This distribution clearly depicts that the median value of IIPED and IPCI is quite high in all the Datasets ranging from 0.7 to 1, approximately. It employs packages in all Datasets influenced by the extend relationship change to a larger extent, and similarly, there can be an extensive use relationship in inheritance-based structures of packages in all the Datasets. However, the JDTCore-3.4 covers the maximum distribution of IPCI among all the Datasets. Furthermore, the median value distribution of IIPUD ranges between 0.1 and 0.5 approximately, whereas the metric value distribution is quite diverse, unlike other findings.

This shows that in most of Datasets there is relatively lesser association of packages based on IIPUD, however, it persists to be increasing in POI-2.5 and POI-3.0. Distribution of IPCI and PF metric values are seen to depict uniformity in all Datasets, i.e., the highest median value of 0.5 in POI-3.0 while lowest in Eclipse-2.1, 3.0 Datasets. Therefore, it can be deduced that the service provider role of packages in Eclipse Datasets is minimum.

4. Results

In this section, we explain our experimental evaluation and data analysis in the context of research questions drawn up for this study. Additionally, results obtained from this experiment are elaborated to find answers for RQs.

4.1. Magnitude of Association with Post-Release Faults

Table 5 shows linear correlation analysis carried out between each of H. Abdeen’s package modularizations with a number of post-release faults in Datasets. Notably, one metric suit may not be appropriate to assert or generalize the thesis of empirical research. There are empirical studies which depict addressing multi-dimensional aspects of package-level metrics. In the prevalent research literature, some of the prominent and highly cited research studies to an extent deal with or consider non-API-based package modularization metrics for theoretical formulation [1,16,41]. In spite of a notable effort by the authors in their work, the emphasis is on improving the package structure using dependency-based metrics and none of the studies mentioned formulated the composite suite of package-level metrics. Moreover, each metric suite is often proposed with specific context and design framework, e.g., API-based, remodularization-based, etc. To further illustrate, cohesion and coupling metrics proposed in [1] are relevant to the re-modularization of the package while maintaining design decisions of software systems. Similarly, Chabra et al devised metrics suite that define their theme as the structural and organizational aspects of packages rather than the non-API-based service provider role of packages being a component of source code [16,41]. Therefore, we incorporated the earlier work of H. Abdeen et al. in our experimental study [5]. In this study, H. Abdeen’s major focus is to determine how coupling and cohesion metrics created in a package as a result of cyclic dependencies help in optimization and source code modularization. In discrete understanding, these dependencies are analyzed as well-identified services (non-API-based) of packages to other components of services. Eventually, such a modularization practice will not affect prevalent software design in an adverse manner. The rationale behind this addition is to frame the context as both metric suites which deal with non-API dependencies of software module or packages. Following are four metrics proposed in the mentioned study:

CohesionQ(p): calculates the international dependencies ratio to overall dependencies among and within a package of a software system.
CouplingQ(p): calculates a number of package provider and client ratios to overall dependencies among and within a package of a software system.
CylicDQ(p): calculates class cyclic dependencies ratio to package dependencies on all other packages of a software system.
CylicCQ(p): calculates the package cyclic connections ratio to package dependencies on all other packages of a software system.

It is pertinent to find the magnitude of association with a number of faults for RQ1. This will help to develop the perspective of to what extent H. Abdeen modularity metrics capture unique dimensions of fault-proneness against traditional metrics suites.

In statistical terms, values listed in Table 5 manifest a correlation analysis of modularization metrics with post-release faults evaluated at significant levels: p-value ≤ 0.05, p-value ≤ 0.1, p-value ≤ 0.2 denoted as (***, **, *), respectively. As a matter of general concept, an analysis is made on the basis of the following theoretical rationale:

A statistically significant correlation of any metric (IPCI, IIPUD, IIPED, PF, and IPSC) with faults is evidence that H. Abdeen’s metrics bear an adverse or adequate effect on the occurrence of software faults.
A statistically significant negative correlation of any metric indicates that packages are less faulty with strong cohesive values or become more faulty with weak cohesive values.
A statistically significant positive correlation with any fault reflects that software system requires code re-factoring according to the design theme of correlating metric.
The absence of statistical significance while correlating metrics with faults should be further investigated using predictive models to conclude if they depict variable quality perspectives of source code.
In addition to a description of H. Abdeen’s package-modularization metrics, an assessment of Martin metrics (N, A, Ca, Ce, I, D) and cyclic metrics is carried out in a similar manner.

Following are prominent inferences from Table 5: First, none of CohesionQ(p), CohesionQ(p), CohesionQ(p), or CohesionQ(p) are significant enough to show an association with post-release faults of Datasets. Therefore, metrics-related package cyclic connections cannot be influential independent variables in this study. Amicably, their inclusion may rather increase a confounding effect during predictive analysis. Secondly, IIPUD and IPCI show a significant negative correlation with post-release faults in almost all the Datasets. Hence, it can be presumed that a weak impact of change in inter-package dependencies and brittle package association without inheritance relationships in software systems could lead to the occurrence of faults, thereby affecting the quality of software. Third, the frequency of statistical significance for H. Abdeen and Martin metrics were found to be considerably high in Datasets having a large number of packages, e.g., Eclipse-2.1 and Eclipse-3.0 analyzed at a p-Value of 0.05 and 0.1, respectively. On the other hand statistical significance was found to be high in Datasets having relatively small numbers of packages, e.g., Lucene-2.4 and JDTCore-3.4 at a p-Value of 0.2.

This leads to the inference that regardless of the size of packages, their design not complying to the minimized dependencies among packages, incompatible sizes in packages, and packages not playing the role of consolidated service providers can be vulnerable to faults, making maintenance work quite tough. Fourth, despite a higher percentage of faulty packages, Camel-1.6 has not exhibited a statistically significant association with most of the modularization metrics. The possible reason for such an exception in the case of Camel-1.6 could be a structured source code architecture following a design mechanism based on IPCI, IIPUD, IIPED, PF, and IPSC modularization metrics. Fifth, IIPED and IPSC are witnessed to possess a negative and significant correlation of faults in almost all Datasets excluding POI-2.5 and POI-3.0. On the contrary, traditional metrics (N, Ca, Ce) have shown a rather positive correlation with faults in most cases.

All these observations are reflections of the fact that H. Abdeen’s proposed non-API package-level metrics are considerably correlated with post-release faults. Interestingly, it is perceptible in all these findings that package-modularization metrics are capable of capturing different aspects of fault association with certain constraints, i.e., the sizes of Datasets, number of faulty packages, and source code structures. Accordingly, their application and utilization in a large legacy object-oriented system may exhibit possible re-factoring opportunities for quality assurance.

4.2. Effort-Aware Classification Performance Indicators

Table 6 summarizes prediction accuracy values using described ER metrics from an experimental analysis of 10-time 10-fold cross-validation. It mainly depicts comparative analysis among T, T + H. Abdeen, and H. Abdeen + T prediction models using mean and standard deviation of ER-BPP and ER-MFM values. It is evident from Table 6 that T + H. Abdeen outperforms T model in five Datasets with a maximum accuracy mean of 0.9 in Camel-1.6. Generally, win score in ER-MFM becomes 5 while win score of 4 (shown with ✓) and loss (shown with ×) score of 3 is witnessed using ER-BPP metric. Following are specific implications of classification outcomes: First, in terms of ER-BPP, T + H. Abdeen seems to be outperforming the T model in a few larger Datasets, e.g., Eclipse-2.1 (improving by 3.9%), Eclipse-3.0 (improving by 1.5%), and POI-3.0 (presenting outstanding improvement of 6.25%). On the other hand, accuracy seems to be lagging in a few Datasets with a small number of packages comparatively, e.g., JDTCore-3.4 (declining by −8.7%) and Lucene-2.4 (declining by −3.1%). Second, prediction modeling with T + H. Abdeen using the ER-MFM method has produced better results than T in most Datasets. However, the T model has remained quite competitive in case of POI-2.5 (deteriorating by −1.6%). Third, Table 6 exhibits the T + H. Abdeen model not overcoming T in case of Lucene-2.4, reflecting that Datasets can be subject to further investigation in the future under new ER-based metrics. This allows us to assess the following key aspects of effort-aware classification:

It can be elicited from the results of Table 6 that effort reduction is considerably better while using T + H.Abdeen model over the T model when evaluation is carried out under ER-BPP and ER-MFM. From a software developer’s perspective, code inspection shall incur minimum effort using H. Abdeen’s metrics.
Prediction threshold, preferably using ER-BPP and ER-MFM, is set objectively to maximize accuracy; however, its success may not be guaranteed for all cases. This could be due to the varying architectural properties of software systems. Nonetheless, it can be well reckoned that fault-proneness prediction is substantially improved using a combination of traditional metrics and H. Abdeen’s package-modularization metrics.
H. Abdeen’s package-level metrics in combination with traditional Robert Martin’s package-level metrics provide more explanatory variations in fault data than using Martin’s package-level metrics only as a baseline. This affirms the utility, effectiveness, and application of H. Abdeen’s package-level metrics during a coding phase toward improving the quality of software systems.

The scope of our analysis and empirical evaluation is exclusively based on the comparison between the T model and the T + H. Abdeen model. However, in order to explore more research directions, we added the H. Abdeen model to determine their performance as a standalone metrics suite. We can figure out certain unique aspects of this analysis: (1) H. Abdeen as a standalone model showed marginal differences in prediction results in a few cases, i.e., Eclipse-2.1 and Eclipse-3.0, while still straggling behind the T + H. Abdeen model. (2) Interestingly, H. Abdeen is witnessed to have performed worse in the case of Camel-1.6 than T + H. Abdeen model and T model, evaluated using ER-BPP and ER-MFM metrics while JDTCore-3.4 followed the same trend under ER-MFM.

(3) It is worth noticing that H. Abdeen’s performance is seen to have a marginal difference against T and T + H. Abdeen, adding to the assertion that H. Abdeen is potentially capable of capturing unique software-quality dimensions when combined with other metrics. Having described this, Robert Martin’s metrics (N, Ca, Ce, D, A, I) persist to be basic quality indicators when evaluating predictive models at the granularity level of packages. This is because of the fact that Martin’s metrics extract the basic quantitative information of software architecture, e.g., number of classes, abstractness, instability, coupling, and cohesion. Hence, their exclusion in any of the empirical studies related to fault prediction may cause compatibility concerns. Therefore, fault-proneness prediction using T + H. Abdeen metrics is a feasible approach of combining the Robert metrics suite with additional or complementary quality indicators for the quantification mechanism of legacy software systems.

4.3. Violin Plot Visualization

Figure 6 shows a violin plot depicting the distribution of prediction values for three different models (T, H. Abdeen, and T + H. Abdeen ) applied to six software systems (Camel-1.6, Eclipse-2.1, Eclipse-3.0, JDTCore-3.4, POI-2.5, and POI-3.0). The X-axis represents software systems and Y-Axis shows effort-aware classification prediction on the basis of ER-MFM. T + H. Abdeen has the highest median for Eclipse-2.1, Camel-1.6, and JDT-Core-3.4, and the lowest distribution suggesting more consistent prediction. However, POI-2.5 has shown wider distribution and lower median values indicating under-performing against the T and H. Abdeen models.

Figure 7 shows a violin plot depicting the distribution of prediction values for three different models (T, H. Abdeen, and T + H. Abdeen) applied to six software systems (Camel-1.6, Eclipse-2.1, Eclipse-3.0, JDTCore-3.4, POI-2.5, and POI-3.0). The X-axis represents software systems and the Y-axis shows effort-aware classification prediction values on the basis of ER-BPP. T + H. Abdeen has the highest median for Eclipse-2.1, Eclipse-3.0, and POI-3.0, and the lowest distribution suggesting more consistent prediction across all the systems.

In summary, there have been fewer inconsistencies of violin plot distribution observed in ER-BPP classification, whereas ER-MFM has the exception of wider distribution in the case of POI-3.0 and JDTCore-3.4. Nevertheless, these systems were still outperforming the T and H.Abeen models. As far as comparative analysis is concerned, our composed Datasets are far better than the Eclipse and JDTCore Datasets used in some past studies. A few such examples are bug prediction-related studies utilizing the same Datasets by Rathor et al. and Babch et al. Their study conducted conventional experiment that produced prediction accuracy of less than 70% without the application of effort-aware ranking and classification [48,49]. This could be the factor that package-level and class-level metrics are exclusive quantifiers and the comparison could be biased statistically. Having said that, recent advancements foster the urge to test software systems at an architectural level, thereby considering package-level metrics as suitable indicators.

4.4. Effort-Aware Ranking Performance Indicators

Graphically, Figure 8 reveals preliminary comparison among different models that select the percentage of faulty packages in increasing and decreasing the order of corresponding LOC, respectively. This analysis is required (i) to understand the capability of different package-level prediction models to achieve cost-effectiveness; (ii) to analyze the optimum level cost effectiveness of H. Abdeen’s package-modularization metrics in ranking the packages. As can be seen from Figure 8, in all the Datasets, the effort-aware fault prediction of an ideal model is a benchmark and has the highest cost-effective ranking measure.

Noticeably, the T + H. Abdeen model has reached relatively close to the ideal model in a few cases, e.g., Eclipse-2.1, Eclipse-3.0, and JDTCore-3.4, validating the efficacy of package-level modularization metrics.

As experimental evidence, the graphical analysis reveals that T + H. Abdeen either outperformed or obtained a marginal edge over the T (baseline) model and random model, which is depicted in cases like POI-3.0, Camel-1.6, Lucene-2.4, and Eclipse-2.1. Graphical analysis also suggests that effort-aware ranking solution provided by T + H. Abdeen significantly dominates traditional models. However, T + H. Abdeen has not been able to overcome the T model in a few cases like Eclipse-2.1 and POI-3.0, as revealed by a slight decline of the T + H. Abdeen model; employing that ranking possibly underperformed in Datasets with fewer packages or fragile design of faulty packages. These findings are also compliant with results obtained and analyzed in Table 5 and Table 6, which substantiated that the H. Abdeen package-modularization metrics have complementary traits against conventional approaches and are substantially cost-effective.

4.5. Implications of Results and Findings

In this section, we outline the implications of results acquired from effort-aware ranking and effort-aware classification prediction models on software design and maintenance prediction. Table 6 and Figure 8 evidently depict that the combination of H. Abdeen et al. and the Robert Martin package-level metrics produce enhanced fault-prediction results. These results entail several significant implications for software design and maintenance practices:

Efficient Resource Allocation: Software development teams can make efficient allocation of resources with the improved accuracy of packages identified as faulty. Efforts become targeted towards more vulnerable areas of the code base making maintenance and inspection activities prioritized.

Code Quality Check: Applying package modularization techniques suggested by H. Abdeen provides holistic views that help in tackling architectural weaknesses. As a result, developers can continuously check on quality, design deterioration, and structural anomalies of software systems. Developers can leverage these results to execute proactive design improvements.

Tailored Maintenance Activities: Experimental methodology of effort-aware ranking and effort-aware classification allows setting the prediction threshold to maximize the accuracy. However, these results do not generalize our research, and therefore, variations allow the context-sensitive application of metrics. Software developers can be peculiar enough to customize their maintenance activities to be better aligned with prediction results.

Finding of this research can be extended beyond the mere fault-prediction domain, as follows:

To guide design phase with the design of modular packages with lesser vulnerability of faults.
Facilitating early detection of design flaws in code base during implementation and testing phase.
Help avoiding degradation of overall quality and offer convenient re-factoring and re-engineering during maintenance phase.

5. Threats to Validity

Threats to construct validity mainly arise due to the unconsolidated relationship between theory and experiment. In particular, the performance measurement methodology (e.g., ER-BPP, ER-MFM, LOC) is unconventional, however statistically reliable and recognized in the recent relevant research literature. Another threat to construct validity could be the use of metrics and the defect Dataset, which is addressed by using heterogeneous repositories, and metrics were computed and incrementally tested by co-authors.

Threats to internal validity are expected out of undue factors that influence our results and graphical analysis. To mitigate such ambiguity, a 10-time 10-fold cross-validation logistic regression (LR) algorithm is used to build a model for achieving unbiased prediction results. Sometimes, repeating the LR approach up to 30 times is also recommended.

Threats to external validity become of concern when findings are generalized with limited experimental settings. We have tried to experiment with open-source software systems with diverse functional and operational features encompassing reasonable size, extensive package-view architecture, and usage in earlier fault-prediction studies. However, experimenting with seven Datasets may not be sufficient, but still worthwhile to build a rationale for our study.

6. Discussion

Implications of results and their impact on existing research:

It is evident that the integration of H. Abdeen’s modularization metrics with traditional metrics marks a significant improvement in fault-proneness prediction of legacy object-oriented software systems. Specifically, results indicate that combining these metrics promises more accuracy and effort-efficient fault-prediction models, as demonstrated in experimental results of Eclipse-2.1, Camel-1.6, and POI-2.5.

These findings lead to important implications for prevalent research work on fault prediction. Results highlight the limitations of traditional class-level metrics and underscore the need to upgrade the code artifacts at the package level to capture structural and architectural dimensions required for fault prediction. This approach allows researchers and practitioners to bridge the gap between method level, class level, and package level for comprehending a holistic picture of fault prediction.

However, we need to understand potential biases, our study poses the following: 1. The Datasets used are limited to open-source Java software systems that may not help us assess and generalize the conclusion as other systems built in other programming language is not domain of experimental work. 2. Modern software architecture is arguably API-centric, and whereas our study is inclined towards non-API-based software systems, the study domain may not have covered these differential dynamics completely. Future research shall consider the comparative analysis of experiments conducted on API and non-API-based software systems, adding validity and extension.

Practical Applications and Recommendations:

Software developers and practitioners can leverage theoretical framework and the extensive experimental work of this study to identify the fault-prone packages and prioritize their resource allocation. Especially, software-development teams can carry out code inspection and maintenance efforts efficiently to enhance modularization and overall software code quality. Following are key recommendations for developers:

Integration of automated plug-ins for package-level modularization metrics with static analysis and fault predictive indication into Integrated Development Environment (IDEs).
Set up maintenance work for high-risk faulty packages identified on a priority basis making the code-review process productive.
Devise the guidelines for package modularization for improved software design and promising architectural decisions.

Unique consideration of study:

As discussed earlier, potential improvements of maintainability and reliability add practical value to this research work. Our study offers a novel approach to fault process prediction that serves as future directions for enhancing software quality practices:

Assessment of H. Abdeen’s package-modularization metrics for systems developed in different programming languages and software paradigms to determine generalizability.
Examining the package-level metrics with a diverse set of software quality attributes that include testability, technical debt, and effort-based cost.
Longitudinal experimental effort to empirically evaluate the impact of these metrics on the process of software evolution.

In addition, the uniqueness and innovation of our study stem from its validation of package-level modularization metrics to improve the fault-prediction models. This asks for the future exploration of this research line and contributes towards software engineering.

7. Conclusions

In this paper, we empirically presented the formulation of fault prediction using non-API-based package-modularization metrics. Our theoretical framework, experimental work, and results depict significant aspects for software engineers to practically renovate areas like fault identification, resource allocation, and modular software design. Following is a breakdown of our key findings.

1. Improved fault-prediction capability: The results obtained exhibit that package-modularization metrics are better predictors of faults. In particular, effort-aware ranking and classification scenarios were estimated to conclude that package-modularization metrics turned out to be better predictors of faults when used in combination with traditional package-level metrics. As realized from Table 6, Eclipse-2.1, Camel-1.6, and POI-2.5 obtained improved fault-prediction results by 3.3%, 4.65%, and 6.25%, respectively. It can be convincingly conjectured that H. Abdeen’s proposed package-modularization metrics can portray unique and complementary view fault prediction, thereby ensuring error-free source code.

2. Effort-Aware Classification and Ranking: Effort-aware prediction models provide a comprehensive assessment with an improved performance in different kinds of software systems with different threshold metrics. It was evident in the case of ER-BPP that Eclipse-2.1 achieved 3.9% of the prediction improvement and 3.3% in the case of ER-MFM, JDTCore-3.4 had a prediction success of 4%.

3. Software Design Dependency Management:

The dependency management of package-based OO design is inevitable, precisely in the case of non-API-based legacy software systems.

In evaluation, we have shown that prediction accuracy in ranking and classification remains reasonably better than traditional models; Performance in the effort-aware ranking of H.Abdeen-proposed metrics to predict fault is seen to improved shown as green curve, almost matching the ideal curve (red) for the POI-2.5, Lucene-2.4, and JDTCore-3.4 software systems. Note that the percentage of faults was plotted against 1000 lines of source code while ranking packages. The mechanism of effort-aware prediction allows developers and software practitioners to prioritize maintenance process activities with a key focus on high-risk identified code bases. In this way, software project managers can optimize their resource allocation to resolve critical issues promptly, achieving better client/customer satisfaction.

It is important to underline that our results are well aligned with a theoretical foundation and experimental framework that is aimed at improving dependency management during the design phase. In addition to experimental significance, the proposed package-modularization metrics are witnessed as effective design concepts for re-modularization. This also consolidates their theoretical standpoint to guide the re-factoring process without affecting prevalent design decisions. Notably, maintenance, technical debt, reliability, and re-engineering can be significantly enhanced by the effective dependency management of package modularization.

In the future, we aim to extend our current study in comparison to other existing package-modularization metrics and present empirical evaluation over other software quality attributes, like maintainability index (MI), testability (TLOC), FindBugs warnings, and PMD source code violation rule warnings. We look forward to exploring the extension of the current study and including comparative analysis with other composite package metrics suites. This can pave the way for an in-depth understanding of non-API package-modularization metrics and effectiveness. This study would help to convey that package modularization bears an important influence on the sustainability and life cycles of software projects.

Our emphasis on practical implications and outlining clear future directions is expected to add value to best practices of software development and create roadmaps for code metric oriented research.

Author Contributions

Conceptualization, M.S. and I.T.; formal analysis, M.S. and I.T.; funding acquisition, J.K. and Y.J.; investigation, M.S., I.T., M.S. and I.T.; methodology, M.S. and I.T.; project administration, M.S. and I.T.; resources, M.S. and I.T.; software, M.S. and I.T.; validation, M.S. and I.T.; visualization, M.S. and I.T.; writing—original draft, M.S., I.T. and J.K.; writing—review and editing, M.S., I.T., J.K. and Y.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Korea Health Technology Research and Development Project through the Korea Health Industry Development Institute under Grant No. HI22C1651 and the Culture, Sports, and Tourism Research and Development Program through the Korea Creative Content Agency Grant funded by the Ministry of Culture, Sports and Tourism of Korea under Grant No. RS-2023-00227648.

Data Availability Statement

We have already made our experimental Datasets and scripts available at GitHub repository (https://github.com/Analyzer2210cau/Effort-Aware-Package-Metrics, accessed on 20 March 2024) for replication purpose.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Candela, I.; Bavota, G.; Russo, B.; Oliveto, R. Using cohesion and coupling for software remodularization: Is it enough? ACM Trans. Softw. Eng. Methodol. (TOSEM) 2016, 25, 24. [Google Scholar] [CrossRef]
Shaikh, M.; Lee, C.G. Aspect Oriented Re-engineering of Legacy Software Using Cross-Cutting Concern Characterization and Significant Code Smells Detection. Int. J. Softw. Eng. Knowl. Eng. 2016, 26, 513–536. [Google Scholar] [CrossRef]
Alsolai, H.; Roper, M. A systematic literature review of machine learning techniques for software maintainability prediction. Inf. Softw. Technol. 2020, 119, 106214. [Google Scholar] [CrossRef]
Sarkar, S.; Rama, G.M.; Kak, A.C. API-based and information-theoretic metrics for measuring the quality of software modularization. Softw. Eng. IEEE Trans. 2007, 33, 14–32. [Google Scholar] [CrossRef]
Abdeen, H.; Ducasse, S.; Sahraoui, H.; Alloui, I. Automatic package coupling and cycle minimization. In Proceedings of the 2009 16th Working Conference on Reverse Engineering, Lille, France, 13–16 October 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 103–112. [Google Scholar]
Pandey, S.K.; Mishra, R.B.; Tripathi, A.K. Machine learning based methods for software fault prediction: A survey. Expert Syst. Appl. 2021, 172, 114595. [Google Scholar] [CrossRef]
Shaikh, M.; Ibarhimov, D.; Zardari, B. Assessing Architectural Sustainability during Software Evolution using Package-Modularization Metrics. Int. J. Adv. Comput. Sci. Appl. 2019, 10. [Google Scholar] [CrossRef]
Şerban, C.; Shaikh, M. Software reliability prediction using package level modularization metrics. Procedia Comput. Sci. 2020, 176, 908–917. [Google Scholar] [CrossRef]
Yang, C.; Xu, Z.; Chen, H.; Liu, Y.; Gong, X.; Liu, B. ModX: Binary level partially imported third-party library detection via program modularization and semantic matching. In Proceedings of the 44th International Conference on Software Engineering, New York, NY, USA, 21–29 May 2022; pp. 1393–1405. [Google Scholar]
Sinha, K.; Han, S.Y.; Suh, E.S. Design structure matrix-based modularization approach for complex systems with multiple design constraints. Syst. Eng. 2020, 23, 211–220. [Google Scholar] [CrossRef]
Prajapati, A. Software package restructuring with improved search-based optimization and objective functions. Arab. J. Sci. Eng. 2021, 46, 9023–9043. [Google Scholar] [CrossRef]
Sharma, D.; Sharma, G. Systematic Literature Review of Search-Based Software Engineering Techniques for Code Modularization/Remodularization. In Computational Intelligence Applications for Software Engineering Problems; Apple Academic Press: Williston, VT, USA, 2023; pp. 241–266. [Google Scholar]
Agnihotri, M.; Chug, A. A systematic literature survey of software metrics, code smells and refactoring techniques. J. Inf. Process. Syst. 2020, 16, 915–934. [Google Scholar]
Baqais, A.A.B.; Alshayeb, M. Automatic software refactoring: A systematic literature review. Softw. Qual. J. 2020, 28, 459–502. [Google Scholar] [CrossRef]
Rathee, A.; Chhabra, J.K. Clustering for Software Remodularization by Using Structural, Conceptual and Evolutionary Features. J. Univers. Comput. Sci. 2018, 24, 1731–1757. [Google Scholar]
Chhabra, J.K. Improving modular structure of software system using structural and lexical dependency. Inf. Softw. Technol. 2017, 82, 96–120. [Google Scholar]
Paixao, M.; Harman, M.; Zhang, Y.; Yu, Y. An empirical study of cohesion and coupling: Balancing optimization and disruption. IEEE Trans. Evol. Comput. 2017, 22, 394–414. [Google Scholar] [CrossRef]
Tunyasuvunakool, S.; Muldal, A.; Doron, Y.; Liu, S.; Bohez, S.; Merel, J.; Erez, T.; Lillicrap, T.; Heess, N.; Tassa, Y. dm_control: Software and tasks for continuous control. Softw. Impacts 2020, 6, 100022. [Google Scholar] [CrossRef]
Martin, R.C. Design principles and design patterns. Object Mentor 2000, 1, 1–34. [Google Scholar]
Melo, W. Evaluating the impact of object-oriented design on software quality. In Proceedings of the 3rd International Software Metrics Symposium, Berlin, Germany, 25–26 March 1996; IEEE: Piscataway, NJ, USA, 1996; pp. 90–99. [Google Scholar]
Harrison, R.; Counsell, S.J.; Nithi, R.V. An evaluation of the MOOD set of object-oriented software metrics. Softw. Eng. IEEE Trans. 1998, 24, 491–496. [Google Scholar] [CrossRef]
Sievi-Korte, O.; Beecham, S.; Richardson, I. Challenges and recommended practices for software architecting in global software development. Inf. Softw. Technol. 2019, 106, 234–253. [Google Scholar] [CrossRef]
Papamichail, M.D.; Diamantopoulos, T.; Symeonidis, A.L. Measuring the reusability of software components using static analysis metrics and reuse rate information. J. Syst. Softw. 2019, 158, 110423. [Google Scholar] [CrossRef]
Elish, M.O.; Al-Yafei, A.H.; Al-Mulhem, M. Empirical comparison of three metrics suites for fault prediction in packages of object-oriented systems: A case study of Eclipse. Adv. Eng. Softw. 2011, 42, 852–859. [Google Scholar] [CrossRef]
Sarkar, S.; Kak, A.C.; Rama, G.M. Metrics for measuring the quality of modularization of large-scale object-oriented software. Softw. Eng. IEEE Trans. 2008, 34, 700–720. [Google Scholar] [CrossRef]
Zhao, Y.; Yang, Y.; Lu, H.; Zhou, Y.; Song, Q.; Xu, B. An empirical analysis of package-modularization metrics: Implications for software fault-proneness. Inf. Softw. Technol. 2015, 57, 186–203. [Google Scholar] [CrossRef]
Abdeen, H.; Ducasse, S.; Sahraoui, H. Modularization metrics: Assessing package organization in legacy large object-oriented software. In Proceedings of the 2011 18th Working Conference on Reverse Engineering, Limerick, Ireland, 17–20 October 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 394–398. [Google Scholar]
Shaikh, M.; Lee, K.S.; Lee, C.G. Assessing the Bug-Prediction with Re-Usability Based Package Organization for Object Oriented Software Systems. IEICE Trans. Inf. Syst. 2017, 100, 107–117. [Google Scholar] [CrossRef]
Zhao, Y.; Yang, Y.; Lu, H.; Liu, J.; Leung, H.; Wu, Y.; Zhou, Y.; Xu, B. Understanding the value of considering client usage context in package cohesion for fault-proneness prediction. Autom. Softw. Eng. 2017, 24, 393–453. [Google Scholar] [CrossRef]
Zimmermann, T.; Nagappan, N.; Gall, H.; Giger, E.; Murphy, B. Cross-project defect prediction: A large scale experiment on data vs. In domain vs. process. In Proceedings of the the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, Amsterdam, The Netherlands, 24–28 August 2009; pp. 91–100. [Google Scholar]
Huo, X.; Li, M. On cost-effective software defect prediction: Classification or ranking? Neurocomputing 2019, 363, 339–350. [Google Scholar] [CrossRef]
Ni, C.; Xia, X.; Lo, D.; Yang, X.; Hassan, A.E. Just-in-time defect prediction on javascript projects: A replication study. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2022, 31, 1–38. [Google Scholar] [CrossRef]
Al Dallal, J. Fault prediction and the discriminative powers of connectivity-based object-oriented class cohesion metrics. Inf. Softw. Technol. 2012, 54, 396–416. [Google Scholar] [CrossRef]
Zheng, W.; Xun, Y.; Wu, X.; Deng, Z.; Chen, X.; Sui, Y. A comparative study of class rebalancing methods for security bug report classification. IEEE Trans. Reliab. 2021, 70, 1658–1670. [Google Scholar] [CrossRef]
Chidamber, S.R.; Kemerer, C.F. A metrics suite for object oriented design. Softw. Eng. IEEE Trans. 1994, 20, 476–493. [Google Scholar] [CrossRef]
Mende, T.; Koschke, R. Effort-aware defect prediction models. In Proceedings of the 2010 14th European Conference on Software Maintenance and Reengineering, Madrid, Spain, 15–18 March 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 107–116. [Google Scholar]
Bennin, K.E.; Toda, K.; Kamei, Y.; Keung, J.; Monden, A.; Ubayashi, N. Empirical evaluation of cross-release effort-aware defect prediction models. In Proceedings of the 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS), Vienna, Austria, 1–3 August 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 214–221. [Google Scholar]
Yang, Y.; Zhou, Y.; Liu, J.; Zhao, Y.; Lu, H.; Xu, L.; Xu, B.; Leung, H. Effort-aware just-in-time defect prediction: Simple unsupervised models could be better than supervised models. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, Seattle, WA, USA, 13–18 November 2016; pp. 157–168. [Google Scholar]
Huang, Q.; Xia, X.; Lo, D. Supervised vs. unsupervised models: A holistic look at effort-aware just-in-time defect prediction. In Proceedings of the 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), Shanghai, China, 17–22 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 159–170. [Google Scholar]
Varghese, B.G.; Raimond, K.; Lovesum, J. A Novel Approach for Automatic Remodularization of Software Systems using Extended Ant Colony Optimization Algorithm. Inf. Softw. Technol. 2019, 114, 107–120. [Google Scholar] [CrossRef]
Abdeen, H.; Sahraoui, H.; Shata, O.; Anquetil, N.; Ducasse, S. Towards automatically improving package structure while respecting original design decisions. In Proceedings of the 2013 20th Working Conference on Reverse Engineering (WCRE), Koblenz, Germany, 14–17 October 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 212–221. [Google Scholar]
Shaikh, M.; Jalbani, A.H.; Ansari, A.; Ali, A.; Memon, K. Evaluating Dependency based Package-level Metrics for Multi-objective Maintenance Tasks. Int. J. Adv. Comput. Sci. Appl. 2017, 8, 345–354. [Google Scholar] [CrossRef]
Kamei, Y.; Matsumoto, S.; Monden, A.; Matsumoto, K.i.; Adams, B.; Hassan, A.E. Revisiting common bug prediction findings using effort-aware models. In Proceedings of the 2010 IEEE International Conference on Software Maintenance, Timisoara, Romania, 12–18 September 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 1–10. [Google Scholar]
Elish, M.O. Exploring the relationships between design metrics and package understandability: A case study. In Proceedings of the 2010 IEEE 18th International Conference on Program Comprehension, Braga, Portugal, 30 June–2 July 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 144–147. [Google Scholar]
Kamei, Y.; Shihab, E.; Adams, B.; Hassan, A.E.; Mockus, A.; Sinha, A.; Ubayashi, N. A large-scale empirical study of just-in-time quality assurance. IEEE Trans. Softw. Eng. 2013, 39, 757–773. [Google Scholar] [CrossRef]
Zhou, Y.; Xu, B.; Leung, H.; Chen, L. An in-depth study of the potentially confounding effect of class size in fault prediction. ACM Trans. Softw. Eng. Methodol. (TOSEM) 2014, 23, 10. [Google Scholar] [CrossRef]
Arisholm, E.; Briand, L.C.; Johannessen, E.B. A systematic and comprehensive investigation of methods to build and evaluate fault prediction models. J. Syst. Softw. 2010, 83, 2–17. [Google Scholar] [CrossRef]
Rathore, S.S.; Kumar, S. Software fault prediction based on the dynamic selection of learning technique: Findings from the eclipse project study. Appl. Intell. 2021, 51, 8945–8960. [Google Scholar] [CrossRef]
Babich, D.; Clarke, P.J.; Power, J.F.; Kibria, B.G. Using a class abstraction technique to predict faults in OO classes: A case study through six releases of the eclipse JDT. In Proceedings of the 2011 ACM Symposium on Applied Computing, TaiChung, Taiwan, 21–24 March 2011; pp. 1419–1424. [Google Scholar]

Figure 1. Illustrating package design.

Figure 2. Data processing methodology.

Figure 3. Evaluation mechanism.

Figure 4. SLOC–Alberg diagram.

Figure 5. Box-plot representation of metrics for all Datasets.

Figure 6. Effort-aware violin plots at ER-MFM criteria.

Figure 7. Effort-aware violin plots at ER-BPP criteria.

Figure 8. Comparison with trivial models when optimizing inspection costs and number of defects.

Table 1. Description of investigated package-modularization metrics.

Metric	Definition
Index of Package Changing Impact	$I P C I (p) = 1 - \frac{C l i e n t s (p)}{1 - \|P\|}$
Index of Package Communication Diversion	$P U F (p) = \frac{1}{Uses (p)}$
	$I I P U D (p) = \{\begin{matrix} P U F (p) \times (1 - \frac{1 - \|U s e s (p)\|}{\|U s e s_{c} (p)\|}) \\ 1 : \|U s e s_{c} (p)\| = 0 \end{matrix}$
Index of Inter-Package Extend Diversion	$P E F (p) = \frac{1}{Ext (p)}$
	$I I P E D (p) = \{\begin{matrix} P E F (p) \times (1 - \frac{1 - \|E x t (p)\|}{\|E x t_{c} (p)\|}) \\ 1 : \|E x t_{c} (p)\| = 0 \end{matrix}$
Index of Package-Goal Focus	$I n I n t (p, q) = \{c \in I n I n t (p) \| \exists x \in q : R e q (c, x)\}$
	$R o l e (p, q) = \frac{InInt (p, q)}{I n I n t (p)}$
	$P F (p) = \frac{\sum_{p i \in C l i e n t s (p)} R o l e (p, p i)}{C l i e n t s (p)}$
Index of Package-Service Cohesion	$C S (p, q) = \{I n t \in I n I n t (p) \| i n t \in p r o v i d e r_{c} (q)\}$
	$S P_{k} (p, q) = \{\begin{matrix} \frac{\|λ_{q, k}\|}{C S (p, q)} \\ 1 E l s e \end{matrix}$
	$C S_{c o h e s i o n} (p, q) = \frac{\sum_{K i \in C l i e n t s_{p} (p)} S P_{k} (p, q)}{\|C l i e n t s_{p} (p)\|}$
	$I P S C (p) = \frac{\sum_{q i \in C l i e n t s_{p} (p)} C S_{c o h e s i o n} (p, q)}{\|C l i e n t s_{p} (p)\|}$

Table 2. Notations used in Table 1.

Notations	Definition
p and q	Two different packages in a set of packages $P$ .
c and x	Two different classes.
$C l i e n t s_{p} (p)$	Set of all packages that depend on p.
$U s e s (p)$ and $U s e s_{c} (p)$	The number of packages and classes used (without inheritance relationships) by package p, respectively.
$E x t (p)$ and $E x t_{c} (p)$	The number of packages and classes extended (with inheritance relationships) by package p, respectively.
$I n I n t (p, q)$	The incoming use and extend dependencies from package q to package p.
$I n I n t (p)$	The incoming use and extend dependencies towards package p.
$R e q (x, c)$ and $R o l e (p, q)$	Return true when x uses or extends c and determines role p in q.
$P r o v i d e r_{c} (p)$	Set of all classes outside p that c depends on.
$λ_{q, k}$	Denotes set of classes resulting from intersection of composite services: $C S (p, q)$ and $C S (q, k)$ .

Table 3. Description of Martin suite.

Metric	Definition
N	Number of concrete, abstract classes, and interfaces included in the package.
Ca	Afferent Coupling: number of other package entities that depend upon classes within a package.
Ce	Efferent Coupling: number of package entities that depend upon other package classes.
A	Abstractness: ratio of abstract classes in package to total number of classes in a package.
I	Instability: ratio of Efferent Coupling to total coupling (Ce + Ca).
D	Distance: distance from the main sequence: $D = A + I - 1$ .

Table 4. Structural information of subject systems.

System	Versions	Number of Packages	Faults	Major Faulty Packages	Percentage of Faulty Packages	Non-API Methods	KSLOC
Eclipse	2.1	428	657	193	45%	84,834	2780
Eclipse	3.0	642	1523	307	47%	113,798	1296
POI	2.5	74	107	15	20%	8309	93
POI	3.0	80	316	30	38%	12,283	140
Camel	1.6	210	144	23	62%	12,275	98
JDTCore	3.4	46	47	32	47%	6123	278
JDTCore	2.1	38	19	2	10%	4123	231
Lucene	2.4	42	318	34	80%	9233	125
Ant	1.6	62	31	4	9%	3133	96
JEdit	4.2	28	12	1	10%	2122	84
JEdit	4.3	31	11	2	10%	2345	91

Table 5. Magnitude of association with post-release faults in larger Datasets.

Metric	Eclipse-2.1	Eclipse-3.0	POI-2.5	POI-3.0	Camel-1.6	JDTCore-3.4	Lucene-2.4
IIPUD	−0.006	−0.23 ***	−0.25	−0.19	0.12	−0.34 *	−0.34 *
IIPED	−0.35 ***	−0.17 ***	0.0002	0.0003	0.019	−0.30 *	−0.30 *
IPCI	0.002	−0.28 ***	−0.31 *	−0.27 *	0.034	−0.34 *	−0.34 *
PF	0.14 **	0.05	−0.10	−0.13	0.16	−0.26	−0.26
IPSC	−0.20 ***	−0.22 ***	0.21	0.18	0.23 *	0.19	0.12
CohesionQ(p)	0.002	0.007	0.21	0.18	0.0023	0.19	0.12
CouplingQ(p)	0.0039	0.05 *	0.29	0.16	0.0031	0.29	0.18
CyclicDQ(p)	0.039	0.027	0.31	0.38	0.0073	0.19	0.32
CyclicCQ(p)	0.002	0.008	0.61	0.21	0.0033	0.003	0.22
N	0.62 ***	0.54 ***	0.95 ***	0.55 ***	0.72 ***	0.86 ***	0.12
Ca	0.26 **	0.27 ***	0.34 *	0.30 *	0.70 ***	0.58 ***	−0.34 *
Ce	0.15 **	0.24 ***	0.28	0.29 *	0.45 ***	0.32 **	−0.30 *
I	0.0024	0.03	0.11	0.057	0.08	−0.0029	−0.34 *
D	0.02	−0.02	−0.13	−0.11	0.11	−0.042	−0.26
A	−0.005	0.002	0.002	−0.047	0.26 ***	0.087	−0.26

Table 6. Intra-release ER models: comparison.

System	T	H. Abdeen	T + H. Abdeen	% Improvement
(a) ER-MFM
Eclipse-2.1	0.60 ± 0.08	0.619 ± 0.08	0.62 ± 0.07	3.3% ✓
Eclipse-3.0	0.63 ± 0.05	0.60 ± 0.08	0.65 ± 0.047	3.1% ✓
POI-2.5	0.6 ± 0.25	0.60 ± 0.3	0.59 ± 0.13	−1.6% ×
POI-3.0	0.8 ± 0.12	0.60 ± 0.08	0.85 ± 0.14	6.25% ✓
Camel-1.6	0.86 ± 0.05	0.85 ± 0.056	0.9 ± 0.06	4.65% ✓
JDTCore-3.4	0.75 ± 0.17	0.72 ± 0.03	0.78 ± 0.15	4% ✓
Lucene-2.4	0.51 ± 0.11	0.50 ± 0.01	0.51 ± 0.12	–
Win\Tie\Loss	-	-	-	5\1\1
(b) ER-BPP
Eclipse-2.1	0.59 ± 0.1	0.613 ± 0.07	0.61 ± 0.09	3.9% ✓
Eclipse-3.0	0.64 ± 0.037	0.60 ± 0.08	0.65 ± 0.06	1.5% ✓
POI-2.5	0.57 ± 0.24	0.60 ± 0.08	0.58 ± 0.20	1.7% ✓
POI-3.0	0.77 ± 0.1	0.58 ± 0.3	0.81 ± 0.09	5.1% ✓
Camel-1.6	0.81 ± 0.13	0.81 ± 0.041	0.79 ± 0.3	−2.5% ×
JDTCore-3.4	0.57 ± 0.16	0.57 ± 0.067	0.52 ± 0.17	−8.7% ×
Lucene-2.4	0.63 ± 0.11	0.57 ± 0.09	0.61 ± 0.12	−3.1% ×
Win\Tie\Loss	-	-	-	4\0\3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shaikh, M.; Tunio, I.; Khan, J.; Jung, Y. Effort-Aware Fault-Proneness Prediction Using Non-API-Based Package-Modularization Metrics. Mathematics 2024, 12, 2201. https://doi.org/10.3390/math12142201

AMA Style

Shaikh M, Tunio I, Khan J, Jung Y. Effort-Aware Fault-Proneness Prediction Using Non-API-Based Package-Modularization Metrics. Mathematics. 2024; 12(14):2201. https://doi.org/10.3390/math12142201

Chicago/Turabian Style

Shaikh, Mohsin, Irfan Tunio, Jawad Khan, and Younhyun Jung. 2024. "Effort-Aware Fault-Proneness Prediction Using Non-API-Based Package-Modularization Metrics" Mathematics 12, no. 14: 2201. https://doi.org/10.3390/math12142201

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Effort-Aware Fault-Proneness Prediction Using Non-API-Based Package-Modularization Metrics

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Description of Package-Level Metrics under Investigation

3.2. An Illustrated Example of Package Design

3.3. Description of Robert Martin’s Metrics

The Empirical Study

3.4. Data Processing

3.5. Evaluation Mechanism

3.6. Modeling Technique

3.6.1. Correlation Analysis

3.6.2. Multivariate Logistic Regression

3.6.3. Effort-Aware Ranking

3.6.4. Effort-Aware Classification

3.7. Datasets

4. Results

4.1. Magnitude of Association with Post-Release Faults

4.2. Effort-Aware Classification Performance Indicators

4.3. Violin Plot Visualization

4.4. Effort-Aware Ranking Performance Indicators

4.5. Implications of Results and Findings

5. Threats to Validity

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI