The idea of data calibration

**Figure 1.** The idea of calibrating project data.

The current paper focuses on extending the two currently existing methods to a third method, taking the weaknesses and shortcomings of the existing methods into account. A summary of the three methods is given below the *calibration* step of Figure 1. The first calibration method has been proposed by [15] and has been validated on only 24 projects by [19]. The procedure consists of a sequence of tests that removes data from the empirical database until the lognormality test is accepted for the project as a whole (no clustering). More recently, this calibration method has been extended by [20] and includes human partitioning as an initialisation step before the calibration actually starts its sequence of hypothesis tests. The underlying idea is that humans can better divide activities into clusters based on their knowledge about the project, and only afterwards, the calibration phase processes the data of each cluster to test the lognormallity of the calibration phase. A summary of both calibration methods (i.e., the original calibration method and its extension to human clustering) is given in Section 3. It is important to review both procedures since they form the foundation of the newly developed statistical partitioning heuristic discussed in the current paper. In the remainder of this paper, we will refer to the two calibrating procedures as the *calibration procedure* and the *extended calibration method*. Since both procedures contain strong similarities, they will sometimes be referred to as the two calibration procedures. The new method that will be presented in the current study—which will be referred to as a *statistical partitioning heuristic*—builds further on two currently known calibration methods in literature. The new method still relies on this basic lognormal core assumption but now extends the current calibration procedures with an automatic partitioning phase to define clusters of activities that each has the same parameters values (average and standard deviation) for their lognormal distribution. This method will be discussed in Section 4. In the computational experiment of Section 5, the three procedures will be tested on a set of 125 empirical projects (for which 83 could eventually be used for the analysis), and their performance will be compared. It will be shown that the new statistical partitioning heuristic outperforms the two other procedures but still contains some limitations that can be used as guidelines for future research.

We believe that the contribution and relevance of the current calibration study are threefold. First, and foremost, the current study presents an extended calibration method that allows the project manager to test whether clusters of activities follow a lognormal distribution for their duration. When this hypothesis is accepted, the procedure returns the values for the parameters of this distribution (average duration and standard deviation) such that they can be used for forecasting the future progress of a new project using Monte Carlo simulations. Such simulations can then be done using data from the past rather than arbitrarily chosen numbers, which is often criticised in simulation studies in the literature. Secondly, the calibration method is an extension of two previously published methods that take the same two human biases (rounding and Parkinson) into account. The extensions consist of mixing human expertise with automatic statistical testing, as well as allowing partitioning during testing rather than treating the whole project as one cluster of identical activities. Finally, to the best of our knowledge, this is the first study that calibrates data on such a large empirical dataset of 83 projects collected over several years.

Of course, our approach is only one possible approach of improving the accuracy of duration estimates, and all results should be interpreted within this limitation. Moreover, implementing such a procedure in practice requires a certain level of maturity for the project manager as it assumes that historical data are readily available. Consequently, using the new calibration method might require some additional effort as an initial investment to design a data collection methodology for past projects. Finally, even when project data are available, our approach is only beneficial if past projects are representative for future projects, which implies that some project characteristics are general and typical for the company. Consequently, in case every project is unique and totally different from the previous portfolio of projects, calibrating data would be of no use and relevance.

Of course, other studies in the academic literature have also aimed at estimating distribution parameters. However, we believe our calibration method is the first approach that does this by taking the two biases into account, and we therefore compare the new calibration method only with the two other calibration procedures using the same two biases. We believe that, thanks to the automatic nature of statistical testing in the new calibration method, our calibration method will contribute to a better forecasting of new projects, and hence to reducing the inherent uncertainty of a project with a minimum effort.

## **3. Calibration Procedures**

This section gives a short summary of the two versions for calibrating data—the *calibration procedure* and the *extended calibration method*—as discussed earlier. Both calibration procedures form the foundation for the current paper, which is the reason why their main steps are repeated in Section 3.1. After this summary, the main shortcomings and areas for improvements of the extended calibration method are given in Section 3.2, and these limitations are then used to present the newly developed *statistical partitioning procedure* in Section 4.

## *3.1. Summary of Procedure*

The extended calibration method consists of five main building blocks which are graphically summarised in Figure 2. Steps S1 to S4 are identical to the four steps of the original calibration method, apart from some small technical modifications. The extended calibration method added a fifth initialisation step S0 to these four steps to cluster data into so-called human partitions. As said, these five steps (S0 to S4) are used as foundations for the new statistical partitioning heuristic discussed later, which is the reason why they are reviewed here.

**Figure 2.** Extended calibration method.

## Step 0 (S0). Human Partitioning

The starting point for developing the extended calibration method was inspired by the saying that "*data cannot replace human intuition*", and that human judgement and experience of the project manager should be taken into account when evaluating data of past projects. Indeed, the original calibration method was merely a sequence of statistical tests to calibrate data, and no human input whatsoever about the project was taken into account. It is, however, well justified to state that the wonders of the human brain, although not always very reliable and subject to biases, cannot simply be replaced by a statistical data analysis, and the extension therefore mainly focused on taking this "human expertise" into account. Consequently, in order to avoid potential users of the calibration method from complaining that their human intuition would be completely ignored and replaced by a black-box statistical analysis, the gap between the dark secrets of statistical testing and the human expertise was narrowed by adding a human initialisation phase (S0) that must be executed prior to the four remaining steps of the calibration method (S1 to S4).

This initialisation phase consists of a so-called *managerial partitioning* step that splits the project data into different clusters (called *partitions*). The general idea is that the human expertise (the project manager's knowledge about the project data) should come before any statistical analysis to create clusters of project data with identical characteristics. Treating these clusters separately in the remaining steps S1 to S4, rather than analysing the project data as a whole, should give the statistical calibration method more power to accept some of the project clusters, and reject others for the same project (rather than simply accepting the project data or not). Consequently, the black box analysis of the statistical calibration method is now preceded by a human input phase, and recognizes that activities of a project do not always adhere to one and the same probability distributions. Hence, the main contribution of the extension is that it assumes that computing probability distributions for activities is best done by comparing clusters of completed activities in a project rather than treating the project data as one big homogeneous dataset.

As mentioned earlier, the four remaining steps (S1 to S4) are copied from the original calibration method, only slightly extended with some minor technical adaptations to increase the acceptance rate. The only difference is that these four steps are now carried out on the different partitions separately, instead of using the project data as a whole. Each of these partitions can now pass the lognormality test (accepted partitions are assumed to contain activities with lognormal distribution, and are therefore added to the project database) or not (rejected partitions are thrown away).

In a set of computational experiments, the authors have shown that the managerial partitioning is a promising additional feature for calibrating data. Three managerial criteria have been taken into account to split the project data into partitions. More precisely, the project data were split up based on the *work packages* (WP) the activities belong to, the *risk profile* (RP) defined by the project manager as well as the estimate for the *planned duration* (PD) of each activity. The extended calibration method has been tested on 83 empirical projects taken from [21] (mainly construction projects) and results show that the additional human partitioning step increased the acceptance rate to 97% of the total created partitions.

The four remaining steps of the calibration method are now briefly summarized along the following lines.

## Step 1 (S1). Hypothesis Testing (Lognormal Core)

Testing clusters (or partitions) of data using the four-phased statistical calibration method aims at creating a database of past project data (divided in clusters) in order to better understand and analyse the behaviour of new projects. For each cluster of past project data, it is assumed that the planned and real duration of its activities are known, and it is tested whether the durations of these activities follow a certain predefined probability distribution. Indeed, if the distribution of activity durations is known, its parameters can be estimated and used for analysis of a new project with similar characteristics. The hypothesis test of S1 will be repeated in each of the following steps (S2 to S4) until a final acceptance or rejection is reached. A detailed outline of the hypothesis test is given in the previously mentioned sources for the (extended) calibration method, and its main features are now briefly repeated below.

*Testing variable*: The ratio between the real duration *RDi* and the planned duration *PDi* for each activity *i* is used as the test variable in each cluster. Obviously, when *RDi*/*PDi* < 1, activity *i* was completed early, *RDi*/*PDi* = 1 signals on-time activities while, for *RDi*/*PDi* > 1, the activity *i* suffered from a delay (these will be referred to as tardy activities).

*Hypothesis test*: The hypothesis is now that the testing variable *RDi*/*PDi* follows a lognormal distribution for each activity *i* in the partition under study. This corresponds to testing whether *ln*(*RDi*/*PDi*) follows a normal distribution or not.

*Goodness-of-fit*: To assess whether the hypothesis can be accepted or not, a three-phased approach is followed. First, Pearson's linear correlation coefficient *R* is calculated by performing a linear regression of the test variable on the corresponding Blom scores [22]. The calculated *R* value can then be compared to the values tabulated by Looney and Gulledge [23] to obtain a *p*-value. Finally, the hypothesis is accepted when *p* ≥ *α* with *α* the significance level equal to e.g., 5%. Each cluster that passes the test is added immediately to the database, while the remaining clusters will be subject to a calibration procedure.

*Calibration*: If the hypothesis is not accepted (*p* < *α*), the project data of the cluster is not immediately thrown away. Instead, the data will be calibrated, then put under the same hypothesis test again, and only then a final evaluation and decision will be made. The term *calibration* is used since it adapts/calibrates the data of a cluster by removing some of the data points. It assumes that certain data points in the cluster are subject to human biases and mistakes, and should therefore not be kept in the cluster, while the remaining points should be tested again in a similar way as explained here in S1. Two biases are taken into account, one known as the *Parkinson effect* (S2 and S3) and another to account for *rounding errors* (S4).

## Steps 2/3 (S2 & S3). Parkinson's Law

The (clusters of) project data consist of activity durations of past projects, and since the data are collected by humans, they are likely to contain mistakes. Most of the project data used in the previously mentioned studies are collected using the so-called *project card approach* of [24], which prescribes a formal method to collect data of projects in progress, exactly to avoid these human input mistakes. Nevertheless, people are and will continue to be prone to make errors when reporting numbers, and possible mistakes due to optimism bias and strategic misinterpretations will continue to exist.

For this very reason, the (extended) calibration method takes the *Parkinson effect* into account which states that work fills the allocated time. It recognizes that the reported *RDi* values are not always accurate or trustful, and they might bias the analysis and the acceptance rate of the lognormality hypothesis (S1). In order to overcome these biases, all on-time data points (S2) and a portion of the tardy data points (S3) are removed from the cluster before a new hypothesis test can be performed.

*Remove on-time points (S2)*: The procedure assumes that *all* on-time points are hidden earliness points and should therefore be removed from the cluster. More precisely, all points that are falsely reported as being completed on time, i.e., each activity with *RDi*/*PDi* = 1 in a cluster that did not pass S1, are removed from the analysis. By taking this Parkinson effect into account, the cluster now only contains early and tardy points. Before a new hypothesis test can be performed, the proportion of tardy points should be brought back to the original proportion, as suggested in S3.

*Remove tardy points (S3)*: The removal of these on-time points—that actually were assumed to be early points—distort the real proportion of early versus tardy points in the data cluster, and this distortion should be corrected first. Consequently, an equal *portion* of the tardy points must be removed from the cluster too to bring the data back to the original proportion of early and tardy activities. Note that the calculation of a proportion of tardy points to remove only defines *how many* tardy activities should be removed from the cluster but does not specify which of these tardy points to

remove. In the implementation of the original calibration procedure of [19], the tardy points were selected at random, while in the extended calibration method of [20], the number of tardy points were selected randomly for 1000 iterations and further analyses were carried out on these 1000 iterations to have more stable results.

After the removal of all on-time points, and a portion of the tardy points, the hypothesis test of S1 is executed again on the remaining data in the cluster, now containing a reduced amount of activities. The same goodness-of-fit criteria are applied as discussed in S1 and only when the hypothesis can not be accepted does the procedure continue with S4. Obviously, the data points of accepted clusters are added—as always—to the database.

## Step 4 (S4). Coarse Time Interval

In a final phase, the remaining cluster data are corrected for possible rounding errors made by the collector of the data of the activity durations. More precisely, data points with identical values for the test variable *RDi*/*PDi* are assumed to be mistakenly rounded up or down, as the results of the coarseness of the time scale that is used for reporting the activity durations. For example, when planned values of activity durations are expressed in weeks, it is likely that the real durations are also rounded up to weeks, even if the likelihood that the real duration was an integer number of weeks is relatively low. Therefore, corrections for rounding errors are taken into account when calculating average values of the Blom scores of these so-called tied points. More precisely, these tied points are not merged to a single score value with weight one, but rather to a set of coinciding points to retain their correct composite weight.

In the study of [20], different implementations of S4 have been tested, taking into account rounding error correction with or without including S3 and S4. It has been shown that rounding correction (S4)—although beneficial for calibrating data—is less important for accepting the hypothesis than correcting the data for the Parkinson effect, which is the reason S4 will be taken into account only after S3, as initially proposed in the original calibration procedure.
