*4.3. Statistical Partitioning*

In this section, it will be shown how the statistical partitioning heuristic iteratively creates clusters of data with similar characteristics ((sub)partitions) based on statistical testing, similar to the managerial partitioning approach that aims at creating data clusters based on human input. Indeed, the statistical partitioning heuristic iteratively selects data points from a current partition and splits them into two separate clusters, and this process is repeated for each created cluster until a created subpartition can be accepted for lognormality. The specific way how these partitions are split into subpartitions does now no longer require human input but will be done using two new statistical strategies.

The so-called *selection strategy* defines which points of the current partition should be selected for removal when splitting a partition. Each removed point will then be put in a first newly created subpartition, while the remaining non-removed points are put in a second new partition, now with less points than in the original partition. This process of removing data points from the original partition continues until a certain stopping criterion is met as defined by the so-called *stopping strategy*. Once the process stops, the original partition—which we will refer to as the *base partition*—will have been split into two separate subpartitions that will both be subject to the hypothesis test again and—if still not accepted—further partitioning. In the remainder of this manuscript, the term *partition L* will be used to indicate the subpartition with the set of activities that have not been removed from the base partition, while the set of activities that were eliminated from the partition and put in a newly created subpartition is now referred to as *partition P*. It should be noted that the naming of the two partitions *P* and *L* found its roots in the testing approach of the previously discussed calibration procedures. Recall that steps S2 and S3 remove all on-time points and a portion of the tardy point from a partition. These removed points are assumed to be a subject of the Parkinson effect (hence, partition *P*) and are thus removed from the database. The remaining data points in the partition were subject to further testing for the lognormal distribution (hence, partition *L*) and—if accepted—are kept in the database. A similar logic is followed for the statistical partitioning heuristic, although the treatment of the two partitions *P* and *L* now depends on the selection and stopping strategies that will be discussed hereafter.

Both the selection strategy and the stopping strategy can be performed under two different settings (*standard* or *advanced*), which results in 2 × 2 = 4 different ways the statistical partitioning heuristic can be performed. Of course, these two strategies cannot work in isolation but will nevertheless be explained separately in Section 4.3.1 and Section 4.3.2. A summary is given in Figure 5.
