This section contains information on primary studies and the findings of the studies that were conducted in response to the research questions. An overview of the research and their citations were provided in this section. After that, we delivered the answer to each question, along with the relevant discussion and interpretation.
5.1. Description of Primary Studies (PS)
An SLR inquiry in the realm of SRPM is new to us to the extent that we are aware. The 16 PS were rated according to a variety of selection criteria. Each study has its unique identification and reference number, which are listed in
Table 3.
Partially oriented studies were discarded in favor of the 16 articles devoted exclusively to SRPM research. Primary studies detection models are described in the following lines, including study methods, findings, and their effectiveness:
Kumar and Yadav [
18]’s Bayesian Belief Network (BBN)-based probabilistic software risk estimate model, which focuses on the most significant software risk indicators, was developed to be used for risk assessment in software development projects. An empirical experiment was carried out to evaluate the model that had been built, using data obtained from software development projects that were used by the business.
Hu et al. [
39] reviewed similar work from the previous two decades and discovered that all available models for prediction assume equal misclassification costs, by neglecting the effects of real-world events in the software project management industry. Indeed, failing to recognize project failure is even more dangerous than incorrectly labeling a project with a high possibility of success as a failure, which is much more common. Furthermore, ensemble learning, which is a well-established technique for improving prediction performance in other areas, has not been substantially examined in the context of software project risk prediction. Their research aimed to fill knowledge gaps in the field by investigating cost-sensitive analysis and classifier ensemble approaches, among other things, both of which were investigated. Using 327 outsourced software project examples, a T-test comparison of 60 alternative risk prediction models revealed that the optimal model is a homogenized set of decision trees (DT) adopting bagging. The findings of the proposed framework reveal that while DT beat Support Vector Machine (SVM) in terms of accuracy (i.e., assuming equal misclassification costs), it outperformed SVM in terms of cost-sensitive analysis. For a quick overview, this paper proposes the first cost-sensitive and ensemble-based hybrid modeling approach for predicting the risk associated with software development projects. A cost-of-misclassification evaluation criterion was also created to evaluate software risk and prediction models.
According to Hu et al. [
44], software project risk assessment and planning have no empirical models. The researchers developed an integrated framework for intelligent software project risk planning to help reduce project risks and increase predictability (IF-ISPRP). IF-ISPRP consists of two fundamental elements: the risk analysis module and the risk planning module. The risk analysis module predicts project success. It creates a cost-effective set of risk control activities from the risk analysis results. They suggested a breakthrough MMAKD approach for complex risk planning. They also utilized the framework to decrease project risk in Guangzhou Wireless City, a social media platform. Other social software projects might help from the risk-management practices discussed here. They believed that integrating risk analysis and planning would help project stakeholders manage project risks.
As part of their effort to describe and explain the present level of knowledge on this topic, Masso et al. [
41] conducted a comprehensive review of the literature on software risk to identify any gaps or areas that may need future investigation. The findings of their SLR revealed that the scientific community’s emphasis had migrated away from the concept of research effort addressing an integrated risk management process and toward work concentrating on specific activities within this process, according to their analysis of the data. It was also feasible to observe an obvious lack of scientific integrity in the validation procedures of the various studies, as well as a weakness in the use of standards or de facto models to characterize the results of these.
In the paper of Hu et al. [
42], a new model for risk analysis of software development projects based on Bayesian networks with causality restrictions were proposed by the authors (BNCC). They showed that when they applied unrestricted automatic causality learning to 302 collected software project data, the proposed model not only discovered causal relationships consistent with expert knowledge, but also outperformed other algorithms such as logistic regression, Naive Bayes, and general BNs in prediction. BNCC is being used in their study to establish the first causal discovery framework for assessing the risk causality of software projects, as well as a model for managing software project risk based on this framework.
BenIdris et al. [
47] proposed an alternative model for software development project risk analysis that is based on BNs with causality constraints (BNCC). They proved that, when combined with expert information, the suggested model is not only capable of detecting causal relationships congruent with expert knowledge, but also outperforms other algorithms such as logistic regression, Naive Bayes, and generic BNs in terms of prediction performance. They established the first framework for studying the risk causality of software projects as well as a model for risk management in software projects based on BNCC theory as a consequence of their research.
Hanci [
52] employed computer-learning techniques to forecast which group of software projects will be at risk. Using the criteria “development source as count”, “software development life cycle model”, and “project size”, they then used ID3 and Naive Bayes algorithms to forecast which group would be in danger. They were able to acquire a variety of accuracy ratios by implementing the holdout model.
Mahfoodh and Obediat [
40] designed a new technique for risk estimation to assist internal stakeholders in software development in analyzing current software risks anticipating a quantitative software risk value. To establish the significance of the risk, it was estimated using historical software bug reports and compared to current and forthcoming bug-fix times, duplicated bug records, and software component priority level. Machine learning was used to determine the risk value on a Mozilla Core dataset (Networking: HTTP software component) and a risk level value for specific software faults was forecasted using the Tensorflow tool. The overall risk was calculated using this approach to be between 27.4% and 84%, with a maximum prediction accuracy of 35%. The researchers observed a strong association between risks derived from bug-fix time estimates and risks derived from duplicated bug reports as a consequence of their investigation.
Cingiz et al. [
46] specifically intended to estimate the effects of project difficulties that could result in project losses in software projects in terms of their risk factor values, as well as rank the risk factors to determine if they could provide specific information about the effects of project problems on an individual basis. To achieve these objectives, five classification algorithms were used to forecast the impact of problems and two methods for filter feature selection to classify the importance of risk variables were used in this study.
Mahdi et al. [
50] summarized the literature on creating and using machine learning algorithms for risk assessment in software development projects, as well as a study of the literature. According to the findings of the review, major developments in machine learning methodology, size measures, and study outcomes have all contributed to the growth and advancement of machine learning in project management over the past decade or more. Additionally, their research provided a more in-depth understanding of software project risk assessment, as well as a vital framework for future work in this area. Furthermore, they discovered that machine learning is more successful in reducing project failures than traditional risk assessment methods. As a result, the probability of the software project’s forecast and the reaction was increased, so giving an additional way to effectively reduce the probability of failure and raise the software development performance ratio.
Shaukat et al. [
7] provided a risk dataset comprising the bulk of the risk prediction parameters as well as software needs for the new software requirements. The collection comprises the vast bulk of the requirements derived from the Software Requirement Specification (SRS) of numerous open-source projects (SRS). The study was split up into three primary phases, the first of which was the collecting of data with a risk-oriented focus. The other two phases were the validation of datasets by IT professionals and the filtration of datasets.
Chen et al. [
51] devised a method for detecting the hazard of a system based on the software behavior of the system’s components. The behavior of untrusted software when it calls other untrusted software is intimately related to system risk; specifically, the more and more untrusted software is called, the greater the risk the system faces, and the converse is true. Therefore, illegal computer operation is a subset of system risk, and the two are inversely proportional to each other in terms of likelihood. A quantitative analytical method (HMM) was used to assess the system’s risk level because the number and scope of untrusted program calls can be accurately monitored, but their risk level cannot be clearly seen. This method guarantees the objectivity and correctness of the results, and it was used in their article. Also included are experiments to study and explain the risk assessment method based on software behavior, which was carried out as part of the article.
Xu et al. [
45] devised a hybrid learning approach that employed evolutionary algorithms and decision trees to evolve optimum subsets of software metrics for risk prediction during the early phase of the software life cycle, and this strategy was deployed. When compared to the use of all metrics for decision tree risk prediction, the experimental results indicate the feasibility and enhanced performance of their method.
Gouthaman and Sankaranarayanan [
43] provided a novel framework for analyzing the dataset gathered through a questionnaire, in which machine learning classifiers were applied and risk assessments were generated for each of the software models that were identified. Software product managers can use the results to select the most appropriate software model based on the software requirements and the probability of risk prediction.
Yu et al. [
49] used the correlation coefficient to combine historical data based on conceptions such as risk weight, expert trust, and risk consequence, allowing the assessor to measure the impact of risk factors at the macro and micro level. According to the findings of the case study, the model was objective and scientific, realistic, and provided a solid framework for risk prediction, mitigation, and control activities.
In the paper of Suresh and Dillibabu [
48], the Development of a new hybridized fuzzy-based risk assessment framework was used in software projects. During decision-making, the proposed technique discovered and prioritized project dangers. Using intuitionistic fuzzy-based TOPSIS, adaptive neurofuzzy inference system-based multi-criteria decision-making (ANFIS MCDM), and fuzzy decision-making trial and evaluation laboratory processes improved software project risk assessment. An improved crow search method was used to modify ANFIS parameters for a more accurate software risk rating (ECSA). Integrating ANFIS with ECSA led to solutions that stayed inside the local optimum and required only minor ANFIS parameter modifications. NASA’s 93 dataset contained 93 software project variables for experimental validation. The experimental results showed that the proposed fuzzy-based framework properly evaluated software development project risks.
Quality Assessment
Here, in
Table 4, we now show the results of the QQ questionnaire, which we had previously presented in the paper.
It is evident from the data that the great majority of QQs received positive responses to their questions. What we do know for certain is that the answer to QQ01 has clearly stated the purpose of every primary study to which it has been applied. QQ04, QQ07, and QQ09 all had results that were largely unsatisfactory. The validity and scope of the vast majority of research have not been questioned so far. It is estimated that the majority of publications are valuable additions to the existing literature, with only a few papers being only somewhat valuable, according to QQ11.
Table 5 shows the results of the quality analysis, which were classified into four categories: very high (9.5 or more), high (8 to 9), average (6.5 to 7.5), and low (6 and below). In addition, the proportion of PS and the number of studies in each of the four categories are summarized in the table below.
Table 6 provides a list of PS and their quality scores, as well as the quality scores of PS with ‘very high’ or ‘high’ quality scores. These two categories comprise 11 studies that obtained an average of 8 or more quality analysis points during the evaluation procedure.
5.2. Answers to the Research Questions
Software Risk Prediction is very important to develop software with fewer hassles within an efficient budget and time. The main purpose of the study is to reduce the risks during SDLC using Machine Learning models or algorithms.
Figure 3 illustrates a year-by-year presentation of the studies that were selected. From 2007 to 2021, a total of 15 years’ worth of data for 16 articles was displayed in this section. Each year, there is a noticeable disparity in the distribution of articles. Our first acquired research publication on SRPM was published in 2007. Since then, a large number of research articles on this topic have been published. During the year 2013, three papers were published. In the years that followed, the rate of publication decreased substantially, with five new publications being produced until 2019. The figure also shows the vast majority of articles published in 2020 and 2021, with article counts ranging from 4 to 3 in each of those years. The overall picture implies that published papers are unequally distributed. We have not detected any pattern of sequential distribution over time, which is consistent with this conclusion.
Depending on the type of dataset used, we divided the studies into two categories: public and private data. In contrast to public datasets, which are made available to the public, private datasets are collected and used by individuals rather than being made available as public datasets. As a result of our investigation, we learned that 37.5% of the datasets utilized in the research were publicly accessible and 62.5% were private. An important challenge is the use of a private dataset to train models for the detection or prediction. As a rule, exposure to a private dataset is confined, making it hard to compare the outputs of different machine learning models in practice.
The purpose of examining the size of datasets is to determine the external validity of the research under discussion. When a large sample of a dataset is used rather than a small dataset, the external validity of the results is improved. In addition, the size of the dataset might have an impact on the results of detection models when used in conjunction with other techniques. When developed on large-scale training data, a detection model has a large learning area, which increases the likelihood of providing more positive outcomes. The studies were separated into three groups according to the size of the datasets that were analyzed. We were able to gather the relevant information about the sample size as well as the size of the data sets that were used in each of the 16 articles we reviewed. Our definition of a “Large” study was one that used more than 200 samples, a “Medium” study was one that used 100 to 200 samples, and a “Small” study was one that used zero to 100 samples. Another class has been defined as “Unknown” because the sample sizes of the dataset used in those studies were missing.
Table 7 summarizes the number of samples in each of the datasets that were used, the number of studies that were conducted, and the percentage of studies that fell into each of the three categories that were conducted.
In the primary studies we have selected, we found two major approaches for data analyses: (1) Machine Learning Approach and (2) Statistical approach. Although most of the papers conducted a machine learning approach. In
Figure 4, the ratio of machine learning and statistical approaches is shown:
Detection techniques, which are employed in the development of SRPM models, are divided into two major categories: (1) Classification model and (2) Regression model. Some other techniques were also applied, those not for detection but for descriptive analyses.
Figure 5 shows the statistical description of these techniques:
As our main concern in this study is software risk prediction using machine learning, we need to know how many of the studies used a machine learning approach. The ratio of the ML approach studies can be seen in
Figure 6.
To evaluate the performance of the prediction models, performance metrics are used. There are several performance metrics that are accessible for evaluation, in general. These performance metrics are also used in the realm of SRPM research to assess and compare the findings obtained using various prediction approaches. The following are the key performance indicators and their descriptions:
Correctly Classified Instances: The sum of True Positive (TP) and True Negative (TN) refers to correctly classified instances.
Incorrectly Classified Instances: The sum of False Positive (FP) and False Negative (FN) refers to incorrectly correctly classified instances.
Accuracy: The number of correctly classified instances out of all the instances is known as accuracy. Accuracy can be expressed as:
Precision: The precision is measured as the ratio of the number of correctly classified positive instances to the total number of positive instances. Precision can be expressed as:
Recall: The recall is obtained by dividing the total number of positive instances by the number of positive instances correctly classified as Positive. Recall can be expressed as follows:
F-Measure: F-Measure is a method of combining precision and recall into a single measure that incorporates both. F-Measure can be expressed as:
Receiver Operating Characteristic (ROC): A graphical way to evaluate the performance of a classifier is a receiver operating characteristic (ROC) analysis. It evaluates a classifier’s performance using two statistics: true positive rate and false positive rate [
53].
Mean Absolute Error (MAE): The Mean Absolute Error (MAE) is a regression model assessment indicator. The MAE of a model with respect to a test set is the average of all individual prediction errors on all instances in the test set [
54]. The discrepancy between the real value and the expected value for each instance is called a prediction error [
55]. Mean Absolute Error (MAE) can be expressed as
where,
predicted value,
true value,
total number of instances.
Mean Squared Error (MSE): Model evaluation metric Mean Squared Error(MSE) is frequently used with regression models. The MSE of a model with respect to a test set is the average of all squared prediction errors in the test set. The difference between the real value and the expected value for an example is the prediction error [
55].
Root Mean Squared Error (RMSE): The standard deviation of the errors that occur while making a prediction on a dataset is known as the Root Mean Squared Error (RMSE). This is similar to MSE, except that the root of the number is taken into account when calculating the model’s accuracy.
Matthews Correlation Coefficient (MCC): The Matthews correlation coefficient (MCC) is a metric that indicates how closely true classes and projected instances are related [
56]. It can be expressed as
Kappa Statistic: As the Kappa statistic takes the chance factor into account, it is essential to consider the outcomes using this method. If the kappa statistic is near to one, the classification without change factor was successful.
Median Absolute Error (MedAE): The median absolute error is not affected by outliers. The loss is derived by averaging all of the absolute deviations between the true value and the prediction. It can be expressed as:
where,
E = true value,
E’ = predicted value,
n = total number of instances.
Table 8 shows the performance metrics that were used in the 16 papers chosen.
We could discover that five performance metrics, Accuracy, Precision, Recall, F-Measure, and Mean Absolute Error (MAE), are the most often-used measures. Among them, Accuracy is the most used among the studies. Moreover, Precision, Recall, F-Measure, and MCC are also very commonly used when datasets are unbalanced. Therefore, considering the type of datasets, one can determine which performance measures should be chosen for the prediction models.
In this part, we summarize the performance of the PS of SRPM in this part. We looked at the results of all 16 PS to get an answer to this question. Allowing for any rating of studies on the basis of performance is highly difficult. We discovered that they use a variety of performance indicators, making it impossible to compare their results.
In
Table 9, the value of the most considered and five performance metrics (Accuracy, Precision, Recall, F-Measure, and MAE) employed in the investigations was shown. For each performance metric, the highest performer’s values are highlighted.
The authors took various approaches in the primary studies, but the main research emphasis of the articles was to predict or assess software risk. Machine learning, statistical methods, and complex systems were all employed in the publications that were considered. Different kinds of classification and regression algorithms were employed in the machine learning approach. Inference and HMM techniques were employed for the statistical procedure. The Fuzzy Dematel approach was utilized for the complex system.
The debate on the limitations and challenges of the software risk prediction model (SRPM) stated in the primary studies is summarized in the following paragraphs:
PS05 mentioned that the study had two limitations. For starters, the suggested technique cannot ensure that the data would provide a complete causal Bayesian network. The causalities discovered could only build a partial causality network due to the sample size constraint. Second, the suggested approach can only detect a fraction of the causalities that are underlying them.
PS01 used a diverse variety of datasets, including industrial projects that were not restricted to a single software firm or type of software. The research, on the other hand, had a hard time deciding on the right value.
PS04 is a systematic literature review, and the authors mentioned that they might have only addedchosen articles from Scopus and missed some of the important articles. They also mentioned that they might have missed some more articles during the study selection process as well.
PS07 indicates that the study developed a model for predicting risk control activities, but it was unable to establish the risk control activities’ execution sequence.
The studies mention the limitations and difficulties mentioned above in predicting SRPM. Future researchers can take these considerations into account when developing prediction models.