4.3.1. Selection Strategy

Recall that the partitioning heuristic splits up a partition into two new subpartitions. Partition *P* contains all the points that are removed from the base partition, while partition *L* then contains all the non-removed points (but now contains less data points compared to the base partition). The *selection strategy* defines which points will be removed from the base partition and put in partition *P*, and which points will be kept to create partition *L*, and can be done in a standard and advanced way.

The *standard selection strategy* does not differ very much from the (extended) calibration method, and defines that only on-time points can be eliminated from the base partition. As a result, partition *P* with the removed activities will then obviously exhibit a pure Parkinson distribution (since all points are on time), and no further statistical partitioning will be performed for partition *P*. Partition *L* can still consist of early, on-time and tardy points, and will be further used by the partitioning heuristic. As shown in Figure 5, no further partitioning will be performed for partition *P*, and its data are therefore thrown away (cf. STOP in Figure 5), but the specific treatment of partition *L* (ACCEPT or CONTINUE) depends on the setting of the stopping strategy, which will be discussed in Section 4.3.2.

**Selection strategy**

**Figure 5.** The four settings for the two strategies.

In the *advanced selection strategy*, not only on-time, but rather *all* activities are potential candidates to be selected for removal, and thus both the resulting partitions *L* and *P* can now contain early, on-time and tardy points. This approach is called advanced since it is fundamentally different than the approach taken by the calibration procedures (S2 and S3). The most important implication of the advanced setting is that partitions in which not all activities are on time can now be created *automatically*. Indeed, the base partition will be split by eliminating activities from it, put them in partition *P* and keep the remaining activities in partition *L* until *L* attains (optimal) fit (this optimal fit will be defined by the stopping strategy discussed in the next section). The set of removed activities (partition *P*), however, can now contain both on-time and early/tardy activities (just as partition *L*) and will thus most likely not exhibit a trivial pure Parkinson distribution (as was the case for the on-time activities of partition *P* under the standard selection strategy). Therefore, this partition *P* of removed activities should also undergo a hypothesis test and possibly a partitioning phase, and so should all later partitions that are created as a result of this consecutive application of the partitioning heuristic. In that way, there is an automatic creation of partitions—hence the name statistical *partitioning* heuristic for the method—that should comprise activities that are similar to each other. Unlike the initial managerial partitioning step, no human judgement has interfered with this type of partitioning, which, from now on, we will call it for this reason *statistical* partitioning. Managerial criteria are thus no longer the sole basis for dividing activities into partitions, which addresses limitation 3 in Section 3.2. Nevertheless, managerial partitioning can of course still be performed in combination with the partitioning heuristic, just like for the calibration procedures.

While the set of activities to be removed from the base partition differs between the standard (only on-time points) and advanced (all points) selection strategy, the partitioning heuristic still needs to determine the sequence in which these activities are removed until a stopping criterion is met. Indeed, in contrast to the calibration procedures, the statistical partitioning heuristic needs to select which activity to eliminate in every partitioning step. The term *partitioning step* is used for an iteration of the partitioning heuristic in which one activity is removed. Thus, if there were 10 partitioning steps for a particular project or partition (under certain settings), then 10 activities were eliminated from that project or partition. For this purpose, the procedure calculates the residuals for all activities in the base partition. The residuals *ei* are calculated as the deviations between the empirical values *ln*(*RDi*/*PDi*) and the linear regression line of those values on the corresponding Blom scores. As a heuristic approach—hence the name statistical partitioning *heuristic*—the activity *i* with the biggest residual *ei* in the base partition is selected for elimination (and put in partition *P*), since it is expected that this would yield the strongest improvement in the goodness of fit (since the created partitions will be subject to a new hypothesis test again).

## 4.3.2. Stopping Strategy

The selection strategy defines how the base partition is split into two different partitions by iteratively removing data points (activities) from it to create partitions *L* and *P*. Despite the fact that this selection mechanism controls the sequence of points to be removed using the calculation of the residuals, it does not define any stopping criterion during this iterative removal process. To that purpose, the statistical partitioning heuristic also introduces two different versions for the stopping strategy. When the stopping criteria are satisfied, the removal of activities is stopped, and the resulting partitions (*L* and *P*) are then the subject to a new partitioning iteration (i.e., they go back to S1 first before they possibly can be split further).

The *standard stopping strategy* employs the *p*-value to define the stopping criterion. More specifically, the elimination of activities stops when *p* reaches or exceeds the significance threshold *α* = 0.05 for partition *L*. Since the *p*-value is also the condition for accepting the lognormality hypothesis in step S1, this implies that the lognormality test is automatically accepted for this partition *L*, and all its activities are assumed to follow the lognormal distribution. In this case, no further partitioning is necessary for partition *L* and all its data points are added to the database (cf. ACCEPT in Figure 5). The data points in partition *P* are treated differently, and the treatment depends on the option in the selection strategy. Indeed, since the partitioning heuristic is always applied anew to the newly created partitions, every partition *P* that is created should go back to step S1 and should be tested for lognormality if the advanced selection strategy is chosen. However, under the standard selection strategy, partition *P* only contains on-time points, and these points will obviously exhibit a pure Parkinson distribution. In this case, no further statistical partitioning will be performed and the data points are removed from the project (cf. STOP in Figure 5).

In the *advanced stopping strategy*, the statistical partitioning is no longer limited to the use of the *p*-value as the only measure for goodness-of-fit, but the activity removal halts when *SEY* (or *R*<sup>2</sup> *a* as a secondary stopping criterion) does no longer improve. Indeed, it applies the standard error of the regression *SEY* as the main basis for assessing the fit, since *SEY* is the preferred measure for this according to literature. The formula for *SEY* is given below:

$$SE\_Y = \sqrt{\frac{\sum\_{i=1}^{n} e\_j^2}{n-2}}.\tag{1}$$

The denominator is the number of activities in the partition *n* minus 2 since there are two coefficients that need to be estimated in our case, namely the intercept and the slope of the regression line. *SEY* is also chosen as the primary optimization criterion. By this, we mean that we deem the fit to the PDLC to be improved when the removal of the selected activity has decreased the *SEY*. Obviously, the lower the *SEY*, the better the fit. A perfect fit is obtained when all data points are on the regression line, so then all residuals are per definition zero, which, through Equation (1), implies that *SEY* is also zero in such a case. However, in about 20% of the cases, the partitioning heuristic did not reach the optimal *SEy* when *only* that *SEy* was considered as optimization criterion; it go<sup>t</sup> stuck in a local optimum. To ge<sup>t</sup> out of this local optimum, we added the adjusted *R*<sup>2</sup> or *R*2*a* as a secondary stopping criterion, which—although a very straightforward approach—proved to be a highly effective solution to the problem. Indeed, after adding *R*2*a* as a secondary optimization criterion, only 1% of the projects did not attain their optimal *SEy*. For completeness, we mention the utilized formula for *R*2*a* with respect to the standard coefficient of determination:

$$R\_4^2 = 1 - \frac{n-1}{n-2}(1 - R^2). \tag{2}$$

Notice that, unless *R*<sup>2</sup> = 1, *R*2*a* is always smaller than *R*2. In our context, we need to employ *R*2*a* instead of *R*<sup>2</sup> to allow comparison of regression models with different numbers of observations (activities indeed ge<sup>t</sup> removed from the original data set). Just like for the *p*-value, the higher the *<sup>R</sup>*2*a*, the better the fit, with a maximum of 1 to reflect a perfect fit.

As mentioned before, the two settings for the stopping strategy should be used in combination with the two settings for the selection strategy, and it is important to draw the attention to the two fundamental differences with the calibration procedures. First, the treatment of the Parkinson points is fundamentally different. Recall that *all* on-time points are removed in the calibration procedures since they are assumed to be the result of the Parkinson effect. In the standard selection strategy, the procedure also removes on-time points, but it is no longer so that the only possibility is to remove *all* on-time points from the project. The partitioning heuristic allows the elimination of just a fraction of the on-time points in order to ge<sup>t</sup> a better fit (defined by the stopping strategy, i.e., *p*-value or *SEY*). The rationale behind this is that not all on-time points are necessarily the result of the Parkinson effect, as the calibration procedures implicitly assume. Some activities *are* actually on time and should thus effectively be part of partition *L*. Secondly, not only on-time points are removed, but also early and tardy points are now subject to removal. While the calibration procedures only remove a portion of tardy points to bring the number early, on-time and tardy points back to the original proportions, the statistical partitioning heuristic takes a different approach, and removes both early, and on-time as well as tardy points (under the advanced selection strategy) until the stopping criterion is satisfied. Such an approach creates partitions (*L* and *P*) that contain all kinds of activities (early, on-time and tardy) that must be subject to further partitioning, if necessary, and this is fundamentally different than the approach taken by the calibration procedures.
