Clustering Empirical Bootstrap Distribution Functions Parametrized by Galton–Watson Branching Processes

Varmann, Lauri; Mouriño, Helena

doi:10.3390/math12152409

Open AccessFeature PaperArticle

Clustering Empirical Bootstrap Distribution Functions Parametrized by Galton–Watson Branching Processes

by

Lauri Varmann

¹ and

Helena Mouriño

^1,2,*

¹

Faculdade de Ciências, Universidade de Lisboa, Campo Grande, 1749-016 Lisboa, Portugal

²

Centro de Estatística e Aplicações, Faculdade de Ciências, Universidade de Lisboa, Campo Grande, 1749-016 Lisboa, Portugal

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(15), 2409; https://doi.org/10.3390/math12152409

Submission received: 30 June 2024 / Revised: 24 July 2024 / Accepted: 28 July 2024 / Published: 2 August 2024

(This article belongs to the Special Issue Stochastic Processes and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

:

The nonparametric bootstrap has been used in cluster analysis for various purposes. One of those purposes is to account for sampling variability. This can be achieved by obtaining a bootstrap approximation of the sampling distribution function of the estimator of interest and then clustering those distribution functions. Although the consistency of the nonparametric bootstrap in estimating transformations of the sample mean has been known for decades, little is known about how it carries over to clustering. Here, we investigated this problem with a simulation study. We considered single-linkage agglomerative hierarchical clustering and a three-type branching process for parametrized transformations of random vectors of relative frequencies of possible types of the index case of each process. In total, there were nine factors and 216 simulation scenarios in a fully-factorial design. The ability of the bootstrap-based clustering to recover the ground truth clusterings was quantified by the adjusted transfer distance between partitions. The results showed that in the best 18 scenarios, the average value of the distance was less than 20 percent of the maximum possible distance value. We noticed that the results most notably depended on the number of retained clusters, the distribution for sampling the prevalence of types, and the sample size appearing in the denominators of relative frequency types. The comparison of the bootstrap-based clustering results with so-called uninformed random partitioning results showed that in the vast majority of scenarios considered, the bootstrap-based approach led, on average, to remarkably lower classification errors than the random partitioning.

Keywords:

Galton–Watson branching process; hierarchical clustering; nonparametric bootstrap; simulation; transfer distance

MSC:

62H30

1. Introduction

Clustering is a technique for finding groups within data where the elements belonging to the same group (or cluster) are similar according to a specific clustering variable. Although cluster analysis has roots in anthropology, it has quickly spread to various fields, such as biology, medicine, remote sensing, pattern recognition, economics, market research, and social network modeling.

Cluster analysis is usually an unsupervised learning technique, meaning there are no labels or gold standards to rely on. Consequently, it is a very challenging procedure because no straightforward measures exist to evaluate its performance. To grasp the quality of clustering is of the greatest importance [1] and has drawn the attention of researchers in recent years.

An issue in cluster analysis is the stability of the clustering results [1]. Two critical aspects of clustering are the data that are used as input and the clustering method. When the data are a sample from a population (usually the case), typical clustering algorithms like k-means and hierarchical clustering do not account for the sampling variability and variability in the data due to measurement errors. Consequently, the clustering result (a partition) is merely a point estimate.

One way to overcome this limitation is to construct data replications by sampling with replacement from the original dataset. Another way to address this limitation is to make cluster validation via data splitting or subsampling. Ref. [1] provides a detailed explanation of these techniques. In this paper, we focused on the first approach, that is, sampling variability was addressed by incorporating the nonparametric bootstrap [2] into the clustering procedure. For k-means clustering, ref. [3] showed through a simulation study that the nonparametric bootstrap can help obtain confidence information about the cluster centers and recover cluster memberships of data objects.

Another point worth considering is the study of unsupervised learning techniques within the framework of stochastic processes, particularly time series analysis. These methodologies have attracted the attention of researchers due to their ability to describe complex real-life situations, such as those in biological and medical studies [4,5]. This framework allows for using domain-specific models, such as stochastic process models, in the clustering process. The models may help to set the focus for the clustering process. For example, the interest may be to perform clustering based on forecasting properties of a time series [6], or study the statistical properties of the clustering algorithms in the context of stationary ergodic processes, either in distribution or in covariance (see [4] for a review on this topic). Ref. [7] proposed a novel clustering approach that formulates the problem of determining the most homogenous clusters by exploring the stochastic nature of the main clustering variable by means of estimating its best-fit continuous probability distribution.

Multi-type Galton–Watson branching processes have been applied to model various phenomena like the growth of bacterial population, epidemic propagation, and cell kinetics [8,9,10,11]. However, the latter processes have hardly been used in clustering. Nowadays, where data come from different sources within one domain, stochastic process models may be useful for meaningfully combining these data. For example, in epidemiology, common quantities tracked are the reproduction number and prevalence. They both could be incorporated into the branching process model.

Here, we developed a method that unifies the bootstrap and stochastic process model approaches to clustering in a certain way. To the authors’ knowledge, this is a pioneering work in which we carried out a single linkage agglomerative hierarchical clustering in the framework of the three-type Galton–Watson branching processes; the bootstrapping techniques were used to quantify the uncertainty of the estimates obtained from the clustering analysis. The grounding for our work is that the bootstrap is consistent for estimating certain functions of the sample mean (Theorem 23.5, p. 331, [12]), and our work will shed light on how this carries over to clustering in the proposed setting. The method consists of three parts, which are explained in the following paragraph.

First, we assumed we have n three-type Galton–Watson processes. For each process, we defined the parameter of interest as the expected total number of realized individuals in a finite number of generations. Considering that the index case is a categorical random variable, possibly with different distributions across the n models, we defined an estimator for each parameter. The second part involved the bootstrap estimation of the parameter of interest for each process, leading to n Monte Carlo approximated bootstrap distribution functions of the model parameter estimators. The third part involved hierarchical agglomerative variable-group clustering of those distribution functions [13].

Variable-group hierarchical clustering is an extension of the traditional pair-group hierarchical clustering, which allows the handling of tied clusters algorithmically. As for the other two hyperparameters involved in clustering, we have used the supremum metric (as in pp. 150–151 [14], but with the domain being the whole

R

) to calculate the distance between distribution functions, and we considered the single-linkage method to determine the closest clusters in each step of hierarchical clustering. The single linkage method has been criticized for its chaining tendencies [15], and it can be argued that the supremum distance between distribution functions, being bounded from above by 1, is not well-suited for measuring widely separated distribution functions. However, the reason behind choosing those two hyperparameters is that, in our opinion, it is a promising way to see the implications of the bootstrap consistency for the clustering results. Notably, the consistency of the bootstrap is stated in terms of the supremum metric, and the order-based definition of the single-linkage dissimilarity between clusters appears to fit well with the order-based concept of supremum. Nevertheless, we did not undertake analytical work to study the method.

We performed a simulation study to analyze the performance of the proposed method. We aimed to evaluate how well the partitions obtained by the nonparametric bootstrap distribution functions agree with the comparison partitions obtained based on the transformed multinomial distribution functions. Some of the parameters of the ground-truth sampling distribution functions were sampled from pre-specified distributions (type-specific prevalence), and others were chosen in a fixed manner (e.g., type-specific reproduction numbers for Galton–Watson processes). The performance was quantified with the adjusted transfer distance between set partitions, also called the classification error distance [16]. This distance function has reasonable properties [16] and a known upper limit [17] that facilitates the interpretation of its values, and it can be defined intuitively in terms of admissible transformations that turn one partition into another [18].

The outline of this article is as follows. In Section 2.1.1, we define the model parameter based on a three-type Galton–Watson branching process model with the purpose of using it in the simulation study. In Section 2.1.2, we describe the branching process and the bootstrap-based clustering method that we developed. In Section 2.2, we formulate the questions that we seek to answer in the simulation study. The general overview of the simulation study is in Section 2.3, Section 2.4 and Section 2.5, while the results of the simulation study are given in Section 3. The interpretation of the results, as well as our work’s limitations and possible extensions, are located in Section 4. The entire computer implementation of the study was undertaken in R software, version 4.1.2. The computer code is available in the Supplementary Materials.

2. Simulation Study

The simulation study is structured according to the MADEP framework (M—methods, A—aims, D—data-generating mechanisms, E—estimands, P—performance measures), which differs from the ADEMP structure introduced by [19] only in that we have moved the methods part to the beginning.

The material of this section follows the first author’s previous work [20].

2.1. Methods

In Section 2.1.1, we outline the derivation of the branching process model parameter.

2.1.1. The Model

Let

T

be the set

{1, 2, 3}

that we call the set of types and its subset

T_{2} = {1, 2}

. Let

N_{0} = N \cup {0}

, where

N

is the set of all positive integers.

Let

δ_{i j} = 1

if

i = j

, and

δ_{i j} = 0

if

i \neq j

. Suppose that

Z_{0} (τ_{0}) = (Z_{0}^{(1)} (τ_{0})

,

Z_{0}^{(2)} (τ_{0})

,

Z_{0}^{(3)} (τ_{0}))

is a random vector with a constant value

e_{τ_{0}} = (δ_{1 τ_{0}}, δ_{2 τ_{0}}, δ_{3 τ_{0}})

.

Following [8], we consider a three-type Galton–Watson branching process with the index case of type

τ_{0}

from

T

as

\begin{matrix} Z_{0} (τ_{0}) = (Z_{0}^{(1)} (τ_{0}), Z_{0}^{(2)} (τ_{0}), Z_{0}^{(3)} (τ_{0})) = e_{τ_{0}}, \\ Z_{i + 1} (τ_{0}) = (Z_{i + 1}^{(1)} (τ_{0}), Z_{i + 1}^{(2)} (τ_{0}), Z_{i + 1}^{(3)} (τ_{0})) = \overset{3}{\sum_{τ = 1}} \overset{Z_{i}^{(τ)} (τ_{0})}{\sum_{j = 1}} X_{i j}^{(τ)}, i \in N_{0}, \end{matrix}

where the sum from

j = 1

to

j = 0

is defined to be the zero vector;

{(X_{i j}^{(τ)})}_{(i, τ, j) \in N_{0} \times T \times N}

is a collection of independent random vectors each with a probability mass function (also called offspring law, or reproduction law)

\begin{matrix} q_{τ} (k_{1}, k_{2}, k_{3}) = P ({X_{i j}^{(τ)} = (k_{1}, k_{2}, k_{3})}), (k_{1}, k_{2}, k_{3}) \in D_{τ} \subset N_{0}^{3} \end{matrix}

where

D_{τ}

is a set of all triples from

N_{0}^{3}

in which

q_{τ} (k_{1}, k_{2}, k_{3})

takes positive values. Let

m

be the following third-order square matrix (the so-called mean matrix of the process) over the set of all non-negative real numbers

R_{+}

,

m = (\begin{matrix} m_{11} & m_{12} & m_{13} \\ m_{21} & m_{22} & m_{23} \\ 0 & 0 & m_{33} \end{matrix}),

(1)

so that for all

τ_{0}

and

τ

from

T

,

m_{τ_{0} τ}

is the expectation of the random variable

Z_{1}^{(τ)} (τ_{0})

, denoted by

E (Z_{1}^{(τ)} (τ_{0}))

.

Before proceeding with the definition of the model parameter of interest to us, we highlight some important assumptions in the Galton–Watson branching process model.

The model is a generation counting process, i.e., it represents the counts of individuals of different types in each generation. One time unit corresponds to one generation.
Once the time increases by one unit, all the ancestors ‘die’, and each of them is replaced by a random number of descendants (of various types).
All individuals reproduce independently, i.e., the number of descendants (of various types) of one individual has no bearing on the number of descendants of another individual.
All individuals of the same type have the same reproduction law, which does not change in time. The reproduction law can only depend on the type of an individual.
The specific form of the mean matrix $m$ assumes that the individuals of type 1 and type 2 can produce individuals of all three types. The individuals of type 3 cannot generate individuals of type 1 and 2, and they can only produce individuals of their own type.

For any fixed non-negative integer r, with the next equality, we define a random variable

W_{r} (τ_{0})

representing the total population size summed over time points from 0 to r and over the first two types, all generated by a single index case (i.e., the individual at time 0) of type

τ_{0}

:

\begin{matrix} W_{r} (τ_{0}) & = (\overset{r}{\sum_{i = 0}} Z_{i} (τ_{0})) {(1, 1, 0)}^{⊤}, \end{matrix}

where the symbol ⊤ means transposition.

For every

τ_{0} \in T

, the probability mass function of

W_{r} (τ_{0})

is denoted by

f_{r, τ_{0}} (\cdot)

, and we assume that the expected value

E (W_{r} (τ_{0}))

is finite. The function

f_{r, τ_{0}} (\cdot)

has the support contained in

N_{0}

for every

τ_{0}

.

We consider the prevalence

p_{τ_{0}}

of individuals of type

τ_{0}

to be positive for every type

τ_{0}

from

T

, and assume that the numbers

p_{1}

,

p_{2}

,

p_{3}

sum to 1. For fixed r, we consider the mixture probability mass function:

\begin{matrix} h_{r} (j) = \sum_{τ_{0} \in T} p_{τ_{0}} f_{r, τ_{0}} (j), j \in N_{0} . \end{matrix}

(2)

Suppose that

W_{r}

is a random variable with the probability function given by Equation (2). We derive the mean value

μ

of the random variable

W_{r}

as follows:

\begin{matrix} μ & = \sum_{j = 0}^{\infty} j h_{r} (j) = \sum_{j = 0}^{\infty} j (\sum_{τ_{0} \in T} p_{τ_{0}} f_{r, τ_{0}} (j)) = \\ = \sum_{j = 0}^{\infty} (\sum_{τ_{0} \in T} p_{τ_{0}} j f_{r, τ_{0}} (j)) = \\ = \sum_{τ_{0} \in T} (p_{τ_{0}} \sum_{j = 0}^{\infty} j f_{r, τ_{0}} (j)) = \\ = \sum_{τ_{0} \in T} (p_{τ_{0}} E (W_{r} (τ_{0}))) . \end{matrix}

(3)

To find a formula for

μ

, we express

E (W_{r} (τ_{0})

in terms of elements of the matrix

m

. For that, we use the relationship

\begin{matrix} E (Z_{i} (τ_{0})) = e_{τ_{0}} m^{i} \end{matrix}

(4)

where

i \in N_{0}

, and

m^{i}

is the i-th power of the matrix

m

[8]. By definition of expectation of a random vector, linearity of expectation, and formula (4),

\begin{matrix} E (W_{r} (τ_{0})) & = E [(\underset{i = 0}{\sum^{r}} Z_{i}^{(1)} (τ_{0}), \underset{n = 0}{\sum^{r}} Z_{i}^{(2)} (τ_{0}), \underset{n = 0}{\sum^{r}} Z_{i}^{(3)} (τ_{0})) {(1, 1, 0)}^{⊤}] = \\ = (\underset{i = 0}{\sum^{r}} E (Z_{i}^{(1)} (τ_{0})), \underset{i = 0}{\sum^{r}} E (Z_{i}^{(2)} (τ_{0})), \underset{i = 0}{\sum^{r}} E (Z_{i}^{(3)} (τ_{0}))) {(1, 1, 0)}^{⊤} = \\ = (\underset{i = 0}{\sum^{r}} e_{τ_{0}} m^{i} e_{1}^{⊤}, \underset{i = 0}{\sum^{r}} e_{τ_{0}} m^{i} e_{2}^{⊤}, \underset{i = 0}{\sum^{r}} e_{τ_{0}} m^{i} e_{3}^{⊤}) {(1, 1, 0)}^{⊤} = \\ = \underset{i = 0}{\sum^{r}} e_{τ_{0}} m^{i} e_{1}^{⊤} + \underset{i = 0}{\sum^{r}} e_{τ_{0}} m^{i} e_{2}^{⊤} . \end{matrix}

(5)

To further express (5), we compute

m^{i}

. For that, we use the fact that the matrix

m

is a block-diagonal matrix of the form

(\begin{matrix} m_{2 \times 2} & u \\ 0_{1 \times 2} & (m_{33}) \end{matrix}),

where

\begin{matrix} m_{2 \times 2} = (\begin{matrix} m_{11} & m_{12} \\ m_{21} & m_{22} \end{matrix}), 0_{1 \times 2} = (\begin{matrix} 0 & 0 \end{matrix}), u = (\begin{matrix} m_{13} \\ m_{23} \end{matrix}), \end{matrix}

so that

m^{i}

, for some two-by-one matrix

v

, is of the form

(\begin{matrix} m_{2 \times 2}^{i} & v \\ 0_{1 \times 2} & (m_{33}^{i}) \end{matrix}) .

Therefore, considering

τ_{0} = 3

, we have

\underset{i = 0}{\sum^{r}} e_{τ_{0}} m^{i} e_{1}^{⊤} = \underset{i = 0}{\sum^{r}} e_{τ_{0}} m^{i} e_{2}^{⊤} = 0

, so

E [W_{r} (3)] = 0

, which allows to simplify formula (3) of

μ

.

On the other hand, to find

E (W_{r} (τ_{0})

for

τ_{0} \in T_{2}

, it suffices to compute

m_{2 \times 2}^{i}

:

\begin{matrix} E (W_{r} (τ_{0})) = \sum_{i = 0}^{r} e_{τ_{0}, 2} m_{2 \times 2}^{i} e_{1, 2}^{⊤} + \sum_{i = 0}^{r} e_{τ_{0}, 2} m_{2 \times 2}^{i} e_{2, 2}^{⊤}, \end{matrix}

where

e_{τ_{0}, 2}

is a two by one vector

(δ_{1 τ_{0}}, δ_{2 τ_{0}})

consisting of the first two elements of

e_{τ_{0}}

in the same order as they are in

e_{τ_{0}}

.

Assuming that the eigenvalues

λ_{1}, λ_{2}

of the matrix

m_{2 \times 2}

satisfy

λ_{1} \neq λ_{2} \neq 1

, then by (p. 7, Equation (2.7), [20]),

\begin{matrix} \sum_{i = 0}^{r} e_{τ_{0}, 2} m_{2 \times 2}^{i} e_{j, 2}^{⊤} = \frac{δ_{τ_{0} j} λ_{2} - m_{τ_{0} j}}{λ_{2} - λ_{1}} \sum_{i = 0}^{r} λ_{1}^{i} + \frac{m_{τ_{0} j} - λ_{1} δ_{τ_{0} j}}{λ_{2} - λ_{1}} \sum_{i = 0}^{r} λ_{2}^{i}, j = 1, 2, \end{matrix}

so that the expected values

E (W_{r} (τ_{0})), τ_{0} = 1, 2,

in (3) are

\begin{matrix} E [W_{r} (τ_{0})] = \frac{λ_{2} - (m_{τ_{0} 1} + m_{τ_{0} 2})}{λ_{2} - λ_{1}} \sum_{i = 0}^{r} λ_{1}^{i} + \frac{(m_{τ_{0} 1} + m_{τ_{0} 2}) - λ_{1}}{λ_{2} - λ_{1}} \frac{1 - λ_{2}^{r + 1}}{1 - λ_{2}}, \end{matrix}

(6)

where

\begin{matrix} \sum_{i = 0}^{r} λ_{1}^{i} = \{\begin{matrix} \frac{1 - λ_{1}^{r + 1}}{1 - λ_{1}}, & if λ_{1} \neq 1 \\ r + 1, & if λ_{1} = 1 \end{matrix}, \end{matrix}

\begin{matrix} λ_{1; 2} = \frac{Tr (m_{2 \times 2})}{2} \pm \sqrt{\frac{{(Tr (m_{2 \times 2}))}^{2}}{4} - det (m_{2 \times 2})}, \end{matrix}

(7)

with the trace and the determinant, respectively, being

\begin{matrix} Tr (m_{2 \times 2}) = m_{11} + m_{22}, \\ det (m_{2 \times 2}) = m_{11} m_{22} - m_{12} m_{21} . \end{matrix}

We refer to the value

\begin{matrix} μ = E (W_{r} (1)) p_{1} + E (W_{r} (2)) p_{2} \end{matrix}

(8)

as the model parameter. We will denote it shortly by

c_{1} p_{1} + c_{2} p_{2}

, where

c_{i} = E (W_{r} (i))

.

2.1.2. Proposed Method

A step-by-step description of the proposed method is as follows.

Specify n populations of individuals (bacteria or cells in different test tubes, people in various countries, etc.) for which the Galton–Watson branching process model could be applied. In each population, we assume that there are individuals of three types labeled by $1, 2, 3$ (e.g., there can be mutant cells of type A, mutant cells of type B, and non-mutant cells). The prevalence of individuals of type j in population i we denote by $p_{i j}$ . The goal is to find groups of populations such that populations within each group are similar to each other in some sense but more dissimilar to populations in other groups.
Take a random sample of size N from each of the n populations and record the type of each individual sampled. The result of this step is n samples of types $t_{i}, i = 1, \dots, n$ , where each of the N elements in the vector $t_{i}$ is either $1, 2$ or 3.
For each population, there is a corresponding three-type Galton–Watson branching process. Specify the number of generations r and the mean matrix $m$ of interest in the branching process model; we assume that these two choices are the same regardless of the population.
Compute the coefficients $E (W_{r} (1)), E (W_{r} (2))$ in the model parameter formula using Equation (5) or Equation (6) (the latter can be used provided the eigenvalues $λ_{1}, λ_{2}$ of the top-left 2 by 2 submatrix of $m$ satisfy $λ_{1} \neq λ_{2} \neq 1$ ).
The general idea of this step is, for each of n populations, to apply the nonparametric bootstrap to estimate the distribution function of the model parameter estimator

${\hat{μ}}_{i} = c_{1} \frac{\sum_{j = 1}^{N} I_{{1}} (T_{i j})}{N} + c_{2} \frac{\sum_{j = 1}^{N} I_{{2}} (T_{i j})}{N},$

where random variable $T_{i j}$ has the categorical distribution $P ({T_{i j} = τ_{0}}) = p_{i τ_{0}}$ , and $τ_{0} \in T$ , $p_{i 1} + p_{i 2} + p_{i 3} = 1$ , $I_{A}$ is the indicator function of the set A. We assume that, for each $i = 1, \dots, n$ , the random variables $T_{i 1}, \dots, T_{i N}$ are independent. From a sampling theory point of view, this is guaranteed, for example, if the selection method is simple random sampling with replacement (pp. 21–22, [21]). Specifically, let $t_{i j}$ be the j-th entry of the observed sample of types $t_{i}$ . The bootstrap begins by defining ${\hat{F}}_{i}$ , the empirical categorical distribution based on $t_{i}$ :

${\hat{F}}_{i} (x) = \frac{1}{N} \sum_{j = 1}^{N} I_{{y \in R : y \leq x}} (t_{i j}) .$

(9)

Choose the number R of bootstrap replications to be used uniformly in all n cases. Then, the Monte Carlo approximation to the nonparametric bootstrap consists of the following steps. Take R bootstrap replications $t_{i l}^{*}$ of size N, $l = 1, \dots, R$ , from each $t_{i}$ , $i = 1, \dots, n$ . Let $t_{i l j}^{*}$ denote the j-th entry in the sample of types $t_{i l}^{*}$ of the i-th population and l-th bootstrap replicate. Calculate the model parameter estimate $μ_{i l}^{*}$ for each replication $t_{i l}^{*}$ , $l = 1, \dots, R$ :

$μ_{i l}^{*} = c_{1} \frac{\sum_{j = 1}^{N} I_{{1}} (t_{i l j}^{*})}{N} + c_{2} \frac{\sum_{j = 1}^{N} I_{{2}} (t_{i l j}^{*})}{N} .$

This leads to approximate bootstrap distribution functions $G_{1}, \dots, G_{n}$ of the model parameter estimator:

$G_{i} (x) = \frac{1}{R} \sum_{l = 1}^{R} I_{{y \in R : y \leq x}} (μ_{i l}^{*}) .$

(10)
This step comprises the computation of the distance matrix needed for the subsequent clustering process. We suggest computing the distance between population i and j, that is, the entry $(i, j)$ of the distance matrix $D_{G}$ , as the supremum distance between the distribution functions $G_{i}$ and $G_{j}$ found in the previous step:

$d_{S} (G_{i}, G_{j}) = sup_{x \in R} | G_{i} (x) - G_{j} (x) | .$
Apply the hierarchical variable-group agglomerative single linkage clustering using the R software package ‘mdendro’ function ‘linkage’ and using the distance matrix $D_{G}$ as an input.
For simplicity, let us assume that the set labels for the populations is $U = {1, 2, \dots, n}$ . Fix the number k of clusters to retain ( $1 \leq k \leq n$ ). To obtain the partition of U into k clusters (if it exists), we use the so-called ‘cutting the multidendrogram’ method. A more detailed explanation of the method is as follows.
Recall that a partition $A^{'}$ is called to be finer than a partition $A^{''}$ (equivalently, $A^{''}$ is coarser than $A^{'}$ ), denoted by $A^{'} ≼ A^{''}$ , if for each cluster A in $A^{'}$ there exists a cluster B in $A^{''}$ such that A is a subset of B. The set of all partitions of U, denoted by $Π (U)$ , together with the relation ≼ is a lattice. The algorithm starts with the trivial partition $A_{1} = {{x} : x \in U}$ . Each time step 2) in the variable-group hierarchical agglomerative clustering algorithm is completed (see [13], pp. 49–50), a new partition of U is obtained by replacing all the clusters that were merged into a strictly larger cluster in that step of the algorithm with the corresponding newly formed superclusters. For example, if initially we had $A_{1}$ , and then after completing part 2) of the algorithm, clusters ${1}, {2}$ merged and clusters ${3}, {4}$ merged, then $A_{2}$ will be ${{1, 2}, {3, 4}, {5}, {6}, \dots, {n}}$ , i.e., ${1}$ and ${2}$ in $A_{1}$ were replaced by ${1, 2}$ , similarly ${3}$ and ${4}$ in $A_{1}$ were replaced by ${3, 4}$ .
Each time step 2) of the algorithm finishes, we obtain a coarser partition than the previous partition because the new partition is obtained by merging at least two of the clusters of the previous partition. Once the trivial partition ${U}$ is formed, the algorithm stops. Therefore, we can associate a chain of partitions $A_{1} ≼ A_{2} ≼ \dots ≼ A_{n^{'}} = {U}$ with each completed variable-group algorithm run, and the set of partitions L forming this chain is a sublattice of $(Π (U), ≼)$ .
Let $(T, h_{l}, h_{u})$ be the multivalued tree (as defined by pp. 46, 48, [13]) on U defined by the clustering output obtained in step 7. Then by definition of the clustering algorithm, every partition from L is a subset of T (the latter contains all the clusters involved in the clustering process, but every partition from L contains only some of them). Therefore, the height function $h_{l}$ is well-defined on all elements of every partition from L. Let $h_{m a x}$ be an arbitrary fixed number larger than $h_{l} (U)$ . For each number a in the interval $(0, h_{m a x})$ , denote

$H_{a} = {A \in L : h_{l} (X) < a for each X \in A} .$

Notice that since $h_{l} (X) < h_{l} (Y)$ whenever $X ⊊ Y$ , then for all $i = 1, \dots, n^{'} - 1$ , ${max}_{A \in A_{i}} h_{l} (A_{i}) < {max}_{A \in A_{i + 1}} h_{l} (A_{i + 1})$ , so for every $a \in (0, h_{m a x})$ , the set $H_{a}$ with the relation ≼ is a subchain of L consisting of partitions from $A_{1}$ up to $A_{j}$ for some $j \in {1, \dots, n^{'}}$ .
Hence, there exists the coarsest partition in the set $H_{a}$ with respect to the relation ≼ for a given value of a, denote it by $A_{\max}^{a}$ . Finally, if there exists $a \in (0, h_{m a x})$ such that $A_{\max}^{a}$ contains k clusters (recall that k was fixed at the beginning of step 8), then $A_{\max}^{a}$ is the output partition. Otherwise, the output partition is not assigned (Figure 1).

Our proposed method is formally presented by Algorithm 1 (based on Sections 1.2, 2, 3.4 [20]), with the main procedure ProposedMet having a subprocedure MCbootstrapDist.

Algorithm 1 Clustering empirical bootstrap distribution functions parametrized by Galton–Watson branching processes

1:: Data: $n, R, N, r \in N$ ; n vectors $t_{i}$ from the Cartesian product ${1, 2, 3}^{N}$ ; $m \in R_{+}^{3 \times 3}$ in the form (1), with the eigenvalues $λ_{1}, λ_{2}$ of $m_{2 \times 2}$ satisfying $λ_{1} \neq λ_{2} \neq 1$
2:: Result: a set $Π = {A_{1}, \dots, A_{n^{'}}}$ of partitions of $U = {1, . . ., n}$ , where $n^{'} \leq n, A_{1}$ is the trivial partition ${{x} : x \in U}, A_{n^{'}} = {U}, A_{1} ≼ \dots ≼ A_{n^{'}}$ , and the relation ≼ is defined in Section 2.1.2, item 8
3:
4:: procedure ProposedMet( ${t_{1}, \dots, t_{n}}, m, r, R$ ) ▹ Proposed method
5:: $c_{τ_{0}} = E (W_{r} (τ_{0}))$ is calculated by (6) for given $m$ and r for both $τ_{0} = 1$
6:: and $τ_{0} = 2$
7:: for all $i \in {1, \dots, n}$ do
8:: $G_{i} \leftarrow$ MCbootstrapDist( $t_{i}, R, c_{1}, c_{2}, N$ )
9:: end for
10:: $U \leftarrow {1, \dots, n}$
11:: $D_{G} \leftarrow$ n-dimensional square matrix with entry $(i, j)$ defined as
12:: $d_{S} (G_{i}, G_{j}) = {sup}_{x \in R} | G_{i} (x) - G_{j} (x) |$
13:: $Π \leftarrow$ HierClust( $U, D_{G}$ ) ▹ [13]
14:: return $Π$
15:: end procedure
16:
17:: procedure MCbootstrapDist( $t_{i}, R, c_{1}, c_{2}, N$ ) ▹ [2]
18:: ${\hat{F}}_{i} \leftarrow$ empirical distribution function based on $t_{i}$ given by (9)
19:: for $l = 1$ to R do
20:: $t_{i l}^{*} \leftarrow$ take a sample of size N from ${\hat{F}}_{i}$
21:: $b_{1}^{*} \leftarrow$ the number of ones in $t_{i l}^{*}$
22:: $b_{2}^{*} \leftarrow$ the number of twos in $t_{i l}^{*}$
23:: $μ_{i l}^{*} \leftarrow c_{1} \frac{b_{1}^{*}}{N} + c_{2} \frac{b_{2}^{*}}{N}$
24:: end for
25:: $G_{i} \leftarrow$ empirical distribution function based on $μ_{i 1}^{*}, \dots, μ_{i R}^{*}$ given by (10)
26:: return $G_{i}$
27:: end procedure

2.2. Aims

Our hypothesis was that the method outlined in Section 2.1.2 performs well. We sought evidence for this claim through

(1): …exploring the performance of the proposed method depending on different combinations of the input values;
(2): …comparing the performance of the proposed method with a random partitioning method, which consists of choosing a partition of the base set U according to the uniform distribution on the set of all k-class partitions of U. We also sought to evaluate the effect size of the proposed method compared to the random partitioning method.

2.3. Data-Generating Mechanisms and Design

We conducted a simulation study with a fully factorial design and with nine factors (inputs). The inputs and their possible values that we used are given in Table 1.

In Table 1, m denotes a label for the matrix

m

given by (1):

m = 1

corresponds to the case where the eigenvalues of the left upper two by two submatrix

m_{2 \times 2}

of

m

satisfy

λ_{2} < λ_{1} < 1

,

m = 2

refers to the case

λ_{2} < λ_{1} = 1

, and

m = 3

to the case

1 \neq λ_{2} < λ_{1} > 1

. These three choices were due to the fact that we assumed

λ_{1} \neq λ_{2}

. We left the case

λ_{1} = λ_{2}

uncovered to keep the number of factors involved in the study smaller and to reduce the computer code execution time. The exact choices of

m_{2 \times 2}

in the cases

m = 1, 2, 3

were respectively

\begin{matrix} (\begin{matrix} 0.6 & 0.19 \\ 0.27 & 0.87 \end{matrix}), (\begin{matrix} 0.6 & 0.2 \\ 0.4 & 0.8 \end{matrix}), (\begin{matrix} 0.3 & 0.9 \\ 0.4 & 3 \end{matrix}) . \end{matrix}

It was sufficient to specify only

m_{2 \times 2}

since, as we saw in the end part of Section 2.1.1, the value of the model parameter does not depend on the value of

m_{33}

.

Our pre-study check revealed that the time cost of the simulations with our computer implementation is sensitive to the value of the input n. We found that to complete the study in the planned timeframe, we should limit n to 60. The alternative choice

n = 30

served, on the one hand, to be sufficiently different from the first choice 60 but at the same time far enough from very small values of n (near 1), which we deemed not to have enough practical importance. The choice of possible values of N is based on analogous computational reasons.

In the choice of the number of retained clusters, k was constrained with the inequality

k \leq n

. To keep the simulation study design simple and to facilitate performance evaluation, we decided to choose the same set of values of k across all values of n. Hence, we considered

k \leq 30

and made a subjective decision to consider the three smallest values for k just above the trivial value

k = 1

.

Type-specific prevalence

p = (p_{1}, p_{2}, p_{3})

was sampled from the fixed set

P_{3}

defined by

\begin{matrix} P = {0, 001 + 0, 002 u : u \in {0, 1, 2, \dots, 24}}, \\ P_{3} = {(p_{1}, p_{2}, p_{3}) : p_{1}, p_{2} \in P, p_{1} < p_{2}, p_{3} = 1 - p_{1} - p_{2}} . \end{matrix}

(11)

The distribution for sampling was the uniform distribution

\begin{matrix} P ({X = i}) = \frac{1}{300}, i \in {1, \dots, 300} \end{matrix}

(12)

if

d i s t = 1

, and a truncated geometric distribution (13) if

d i s t = 2

. The idea behind those choices was to have, in the first case, all prevalence triples from

P_{3}

be equally likely, and in the second case, the distribution to be concentrated on triples, where the sum of the first two components (i.e., the prevalence of the first two types of individuals) is relatively smaller. To practically use the truncated geometric distribution, we first mapped the set

P_{3}

bijectively to the set of the first 300 positive integers

{1, \dots, 300}

in the following way: a triple

y = (y_{1}, y_{2}, y_{3})

succeeds a triple

x = (x_{1}, x_{2}, x_{3})

if and only if the sum of the first two components of

y

is larger than the sum of the first two components of

x

, or if the sum of the first two components in

x

and

y

is equal, but the second component in

y

is larger than in

x

. Suppose that X is a random variable that takes the value

i \in {1, \dots, 300}

if and only if the i-th triple under the order just defined was sampled from the set

P_{3}

. The second distribution (

d i s t = 2

) for sampling prevalence was a truncated geometric distribution of the random variable X given by the following probability mass function:

\begin{matrix} P ({X = i}) = \frac{q {(1 - q)}^{i - 1}}{1 - {(1 - q)}^{300}}, i \in {1, \dots, 300}, \end{matrix}

(13)

where

q = 1 / 10 = 0.1

. One can check that the random variable X under the truncated geometric distributional assumption with

q = 1 / 10

really has a negative value of skewness. A reason behind the choice of

q = 1 / 10

was to ensure that X would have a relatively low expected value. This concentrated the distribution to the domain, corresponding to relatively low prevalence values of type 1 and type 2 individuals. Denoting the expected value of X depending on q by

E (X) (q)

, we have

{lim}_{q \to 1} E (X) (q) = 1

and

{lim}_{q \to 0 +} E (X) (q) = \frac{301}{2}

, but for our choice the expected value

E (X) (1 / 10)

is approximately 10, i.e., the expected value is near the tenth triple in

P_{3}

.

Regarding the remaining three inputs

r, R

and

n_{sim}

, we did not use any guidance for designing the possible values of the input r, so its values were chosen as an expert opinion. Fixing the number of bootstrap samples to

R = 1000

works well (p. 214, [22]). The explanation of how we chose the input

n_{sim}

value is given at the end of Section 2.5.1 below (see inequalities (18)–(20)) since the reasoning depends on the performance measure that we have not yet defined.

We define the set of (simulation) scenarios as

\begin{matrix} S = S_{1} \times S_{2} \times S_{3} \times S_{4} \times S_{5} \times S_{6}, \end{matrix}

(14)

where each set

S_{i}

is defined in Table 1. Therefore, each scenario from

S

is a tuple

(s_{1}, s_{2}, s_{3}, s_{4}, s_{5}, s_{6})

of values of respective inputs

1, 2, \dots, 6

from Table 1. Inputs 7, 8 and 9 had a constant value, so we did not use them for the purpose of defining the scenario.

The set

S

contains 216 elements (scenarios). For the convenient presentation of results, we ordered all scenarios in

S

(lexicographically): a scenario

(y_{1}, y_{2}, y_{3}, y_{4}, y_{5}, y_{6})

succeeds a scenario

(x_{1}, x_{2}, x_{3}, x_{4}, x_{5}, x_{6})

if and only if there exists an element

i_{0}

in

{1, 2, 3, 4, 5, 6}

such that

y_{i_{0}} > x_{i_{0}}

and

y_{i} = x_{i}

for all i less than

i_{0}

(we consider the usual order of natural numbers in each of the sets

S_{i}

).

2.4. Estimands

Our estimand was a partition

A = {A_{1}, \dots, A_{k}}

of n elements of the set

U = {1, \dots, n}

into k clusters. Therefore, all clusters in each partition

A

are non-empty, every two clusters in

A

are disjoint, and the union of all clusters in

A

is U.

2.5. Performance Measures

2.5.1. Average Adjusted Transfer Distance and Its Monte Carlo Error

We wanted to evaluate if our proposed clustering method based on bootstrap empirical distribution functions (10) leads to the same or similar partition of a finite set of individuals U as with clustering using the ground-truth distribution functions of random variables of the form

\begin{matrix} c_{1} \frac{B_{1}}{N} + c_{2} \frac{B_{2}}{N}, \end{matrix}

where

c_{1}, c_{2}

have the same meaning as in the definition of the model parameter, and the random vector

(B_{1}, B_{2}, B_{3})

has a multinomial distribution with the number of trials N and cell probabilities

(p_{1}, p_{2}, p_{3})

as sampled from (11) under assumptions described in Section 2.3:

\begin{matrix} P ({B_{1} = k_{1}, B_{2} = k_{2}, B_{3} = k_{3}}) = \frac{N!}{k_{1}! k_{2}! k_{3}!} p_{1}^{k_{1}} p_{2}^{k_{2}} p_{3}^{k_{3}}, \end{matrix}

where

k_{1}, k_{2}, k_{3} \geq 0, k_{1} + k_{2} + k_{3} = N

.

The comparison of partitions can performed by computing the distance between partitions.

To measure how well any two partitions of U agree, we used the adjusted transfer distance. The following definition of the adjusted transfer distance was adapted from [17] to fit our need to compare partitions with the same number of clusters. Adjusting refers to the fact that we have divided the usual transfer distance value by the number of elements in the base set.

Let U be a finite set with the number of elements equal to

n \geq 1

. Assume that

\begin{matrix} A = {A_{1}, \dots, A_{k}}, B = {B_{1}, \dots, B_{k}} \end{matrix}

are two partitions of U into k clusters, respectively, such that

A

is the underlying ‘true’ partition, and partition

B

is used to approximate partition

A

.

For a natural number k, let

S_{k}

be the set of all permutations of k elements, i.e.,

S_{k}

is a set that contains all bijective functions

σ : {1, \dots, k} \to {1, \dots, k}

. We define the adjusted transfer distance

θ

between partitions

{A_{1}, \dots, A_{k}}, {B_{1}, \dots, B_{k}}

of U by

\begin{matrix} θ ({A_{1}, \dots, A_{k}}, {B_{1}, \dots, B_{k}}) = min \{\frac{n - \underset{i = 1}{\sum^{k}} | A_{i} \cap B_{σ (i)} |}{n} : σ \in S_{k}\} . \end{matrix}

(15)

In [17], the transfer distance is a function defined for every pair of partitions from the set

Π (U)

of all partitions of U. In terms of properties, it is an n-invariant convexely additive metric [16]. Ref. ([23], p. 634) advocates using the transfer distance when the distances under consideration are relatively small or if the number of clusters is less than 5 or 6 (in [23] the adjusted transfer distance is called the “misclassification error distance”). We could follow this recommendation since the number of clusters in our study was always at most 4.

The maximum value of

θ

is known: a consequence of Lemma 6 from Ref. by [17] is that if

n \geq 2 k - 1

(which is satisfied in our simulation study), then

\frac{n - ⌈\frac{n}{k}⌉}{n}

is the maximum value of the range of variation in

θ

(here

⌈\frac{n}{k}⌉

is the smallest integer not below

\frac{n}{k}

). If

n \geq 2 k - 1

, the maximum value of

θ

either does not depend on n and equals

\frac{k - 1}{k}

if k divides n or differs from

\frac{k - 1}{k}

by less than

\frac{1}{n}

if k does not divide n. A near-zero value of

θ

indicates good performance, and a

θ

value near the maximum value of the range of variation is a sign of poor performance.

Algorithm 2 defines the function Theta that is used to calculate the

θ

value in the simulation study. The arguments

Π_{1}, Π_{2}

of Theta are both sets of partitions of U obtained by hierarchical clustering; k is the argument that specifies the cardinality of partitions to look for in

Π_{1}

and

Π_{2}

. If there is not a partition with cardinality k either in

Π_{1}

or in

Π_{2}

, then the

θ

value is defined as NA (not assigned).

We now describe the performance evaluation in a fixed arbitrary scenario. We performed

n_{sim} = 144

simulations. We denote by

n_{sim}^{'}

the effective number of simulations, that is, the number of simulations that led to a non-NA-adjusted transfer distance value. By

θ_{i}

, we represent the value of the adjusted transfer distance in the i-th effective simulation. We use as the performance measure the average adjusted transfer distance value defined as

\begin{matrix} {\bar{θ}}^{*} = \frac{1}{n_{sim}^{'}} \underset{i = 1}{\sum^{n_{sim}^{'}}} θ_{i} . \end{matrix}

(16)

Algorithm 2 Adjusted transfer distance value calculation

Data: $n, n_{1}^{'}, n_{2}^{'}, k \in N$ such that $n_{1}^{'} \leq n$ , $n_{2}^{'} \leq n$ , $k \leq n$ ; two sets of partitions $Π_{1} = {A_{1}, \dots, A_{n_{1}^{'}}}$ , $Π_{2} = {B_{1}, \dots, B_{n_{2}^{'}}}$ of ${1, \dots, n}$ , where $| A_{i} | > | A_{i + 1} |$ for all $i \in {1, \dots, n_{1}^{'} - 1}$ , $| B_{i} | > | B_{i + 1} |$ for all $i \in {1, \dots, n_{2}^{'} - 1}$
Result: the adjusted transfer distance value of k-class partitions $A$ and $B$ respectively in $Π_{1}$ and $Π_{2}$ if such partitions exist; otherwise, the adjusted transfer distance value is not assigned (NA)
function Theta( $Π_{1}, Π_{2}, k$ )
if exist $A \in Π_{1}$ and $B \in Π_{2}$ such that $| A | = | B | = k$ then
$t h e t a \leftarrow θ (A, B)$ ▹ $θ$ defined by (15)
else
$t h e t a \leftarrow NA$
end if
return $t h e t a$
end function

The value of

{\bar{θ}}^{*}

can be seen as a realization of the random variable

\bar{θ}

, which is obtained by viewing each

θ_{i}

as a random variable

{\hat{θ}}_{i}

. We assumed the random variables

{\hat{θ}}_{1}, \dots, {\hat{θ}}_{n_{sim}^{'}}

to be independent and identically distributed. Following (p. 2086, [19]), we estimate the standard deviation of

\bar{θ}

using the Monte Carlo standard error of the average adjusted transfer distance estimator

\bar{θ}

by

\begin{matrix} {SE}_{MC}^{*} (\bar{θ}) = \frac{s}{\sqrt{n_{sim}}} = \sqrt{\frac{1}{n_{sim}^{'} (n_{sim}^{'} - 1)} \sum_{i = 1}^{n_{sim}^{'}} {(θ_{i} - {\bar{θ}}^{*})}^{2}}, \end{matrix}

(17)

where

\begin{matrix} s = \sqrt{\frac{1}{n_{sim} - 1} \sum_{i = 1}^{n_{sim}} {(θ_{i} - \bar{θ})}^{2}} . \end{matrix}

is the empirical standard deviation.

The basis for determining the number of simulations

n_{sim}

was setting a certain upper bound

b = 0.025

for the half-length of the confidence interval of the mean of

\bar{θ}

.

Let

z_{1 - α / 2}

be the

(1 - α / 2)

-quantile of the standard normal distribution such that

z_{1 - α / 2} = 2

(for this choice

α < 0.05

). Using the asymptotic normality of

\bar{θ}

, we obtain

\begin{matrix} z_{1 - α / 2} \frac{s}{\sqrt{n_{sim}}} \leq b, \end{matrix}

(18)

\begin{matrix} n_{sim} \geq {(\frac{s z_{1 - α / 2}}{b})}^{2} . \end{matrix}

(19)

Pre-study checks revealed that under various simulation scenarios, usual values of the standard deviation

\begin{matrix} \tilde{s} = \sqrt{\frac{1}{n_{obs} - 1} \sum_{i = 1}^{n_{obs}} {(θ_{i} - \bar{θ})}^{2}} \end{matrix}

were in the majority of cases bounded by

0.15

if

n_{obs}

was around 10.

Thus, our decision was to fix

s = 0.15

. Then by (19),

\begin{matrix} n_{sim} \geq 144 . \end{matrix}

(20)

Using inequality (20) and by trying to minimize the time cost of the study, we set the number of simulations

n_{sim}

to 144 in each scenario.

2.5.2. Effect Size and Its Monte Carlo Error

In Section 2.5.1, we defined the average adjusted transfer distance

{\bar{θ}}^{*}

as a performance indicator of our proposed method discussed in Section 2.1.2 in comparison with its modification, where bootstrap distributions are replaced with certain transformed multinomial distribution functions. As we pointed out from previous literature, the maximum value of

θ

for a pair of k-class partitions is known, and that gives us a scale against which to compare the values of

{\bar{θ}}^{*}

observed in the simulation study.

In the present subsection, we introduce another way of evaluating the performance. This performance measure is the effect size, which is based on the comparison between

{\bar{θ}}^{*}

obtained under a particular simulation scenario with its reference value

{\bar{ϑ}}^{*}

(yet to be defined) obtained via random partitioning. Given a set U with n elements, random partitioning means sampling a partition from the set of all k-class partitions of U under the uniform probability distribution assumption:

\begin{matrix} P ({X = P}) = \frac{1}{S (n, k)}, P \in Π_{k} (U), \end{matrix}

(21)

where

S (n, k) = | Π_{k} (U) |

is the Stirling number of the second kind, and X is a random partition.

Since

Π_{k} (U)

contains, in general, many elements, e.g.,

S (60, 4) \approx 5.5 \cdot 10^{34}

, it is essential to have an algorithm that allows us to sample from

Π_{k} (U)

without previously enumerating the elements of

Π_{k} (U)

.

Our approach, implemented as Algorithm 3, is as follows. Let

Q_{k}

be the set

\{q = (q_{1}, \dots, q_{k}) \in N^{k} : q_{1} \geq \dots \geq q_{k}, \sum_{i = 1}^{k} q_{i} = n\}

, and

S_{n}

be a set containing all permutations (as n-tuples) of the form

s = (ω_{1}, \dots, ω_{n})

) of elements of U.

Algorithm 3 Random partitioning (adapted from p. 23, [20])

Data: $n, k \in N$ such that $k \leq n$
Result: a k-class partition ${A_{1}, \dots, A_{k}}$ of $U = {1, \dots, n}$
procedure RandPart( $n, k$ )
$Q_{k} \leftarrow \{(q_{1}, \dots, q_{k}) \in N^{k} : q_{1} \geq \dots \geq q_{k}, \sum_{i = 1}^{k} q_{i} = n\}$
$(q_{1}^{*}, \dots, q_{k}^{*}) \leftarrow$ sample one k-tuple from $Q_{k}$ , the probability of selecting a tuple $q = (q_{1}, \dots, q_{k})$ is given by $P ({q} \times S_{n}})$
$U \leftarrow {1, \dots, n}$
$S \leftarrow$ set containing all permutations (as n-tuples) of elements of U
$(ω_{1}^{*}, \dots, ω_{n}^{*}) \leftarrow$ sample one n-tuple from S (all outcomes equally likely)
define $h (x) = \sum_{s = 1}^{x} q_{s}^{*}$ for all $x \in {1, \dots, k}$
$A_{1} \leftarrow {ω_{1}^{*}, \dots, ω_{q_{1}^{*}}^{*}}$
$A_{j} \leftarrow {ω_{h (j - 1) + 1}^{*}, \dots, ω_{h (j)}^{*}}$ for all $j \in {2, \dots, k}$
return ${A_{1}, \dots, A_{k}}$
end procedure

We sample a k-tuple

q

from

Q_{k}

and an n-tuple

s

from

S_{n}

. We define the sample space to be the Cartesian product

Q_{k} \times S_{n}

. We consider the discrete sigma-algebra on

Q_{k} \times S_{n}

. Let

X : Q_{k} \times S_{n} \to Π_{k} (U)

be a random element (that we call a random partition) defined as follows. Take an element

((q_{1}, \dots, q_{k}), (ω_{1}, \dots, ω_{n}))

from the sample space. We read the elements of

(ω_{1}, \dots, ω_{n})

from left to right and put the leftmost

q_{1}

elements into one set, the next

q_{2}

elements into another set, …, and the rightmost

q_{k}

elements of

(ω_{1}, \dots, ω_{n})

into yet another set. The obtained sets constitute a partition, that is, the value of X at

((q_{1}, \dots, q_{k}), (ω_{1}, \dots, ω_{n}))

. We say that

X (q, s)

is a partition with the class structure

q

. The set of all partitions of U with the class structure

q

is denoted by

Π_{k}^{q} (U)

.

Our next goal is to define a certain probability measure on

Q_{k} \times S_{n}

that would help us to sample from

Π_{k} (U)

according to the uniform distribution (21). Let

c_{P} (q_{1}, \dots, q_{k})

be the number of elements of

S_{n}

that leads to the same partition (under construction explained above) for fixed k-tuple

q

from

Q_{k}

, i.e.,

c_{P} (q) = | A_{P} (q) |

, where

A_{P} (q) = {s \in S_{n} : X (q, s) = P}

for fixed

P \in Π_{k}^{q} (U)

. The value of

c_{P} (q)

does not depend on

P

for fixed

q

, so we can drop the index

P

. To express the formula for

c (q)

, for each

q \in Q_{k}

, let

Q_{q}

be the set that contains all components of

q

without repetitions. We define the multiplicity, denoted by

m (q)

, of each element q of

Q_{q}

as the number of its occurrences in

q

. The value of X is invariant to permuting components of

s

corresponding to the same class or permuting classes of components of

s

containing the same number of components. Therefore,

\begin{matrix} c (q) = \prod_{i = 1}^{k} q_{i}! \prod_{q \in Q_{q}} m (q)! . \end{matrix}

If the class structure of

P

is

q

, then the event that a randomly selected partition is

P

is

{q} \times A_{P} (q)

, or put simply

{X = P}

. For each

q \in Q_{k}

, each event of the form

{q} \times A_{P} (q)

corresponds to one and only one partition

P

from

Π_{k}^{q} (U)

, namely

X ({q} \times A_{P} (q))

; and conversely, each partition

P

from

Π_{k}^{q} (U)

corresponds exactly to one event

{q} \times A_{P} (q) = X^{- 1} (P)

(here ⁻¹ is the symbol for the inverse image). For each

q \in Q_{k}

, we define

B_{q} = {{q} \times A_{P} (q) : P \in Π_{k}^{q} (U)}

, and

B

as the disjoint union

B = ⋃_{q \in Q_{k}} B_{q}

. The bijective correspondences

B_{q} ∋ {q} \times A_{P} (q) \mapsto P \in Π_{k}^{q} (U)

can be easily combined to obtain a bijection between

B

and

Π_{k} (U)

. By these bijections, the number of elements in

B

is

S (n, k)

, and for each

q \in Q_{k}

, the number of elements in

{{q} \times A_{P} (q) : P \in Π_{k}^{q} (U)}

is

| {Π_{k}^{q} (U)} |

. Hence,

\begin{matrix} S (n, k) & = | B | = \sum_{q \in Q_{k}} | {{q} \times A_{P} (q) : P \in Π_{k}^{q} (U)} | = \sum_{q \in Q_{k}} | {Π_{k}^{q} (U)} | = \\ = \sum_{q \in Q_{k}} \frac{n!}{c (q)} . \end{matrix}

Therefore,

\sum_{q \in Q_{k}} \frac{n!}{c (q) S (n, k)} = 1,

which allows us to define the probability of selecting a k-tuple

q

from

Q_{k}

as follows:

\begin{matrix} P ({q} \times S_{n}) = \frac{n!}{c (q) S (n, k)} . \end{matrix}

Next, we define that the distribution for selecting

s

from

S_{n}

is the uniform distribution:

P ({q} \times {s} | {q} \times S_{n}) = \frac{1}{n!}

Thus, the probability of obtaining a partition, given the class sizes, is

P ({q} \times A_{P} (q) | {q} \times S_{n}) = \frac{c (q)}{n!} .

Therefore, the probability of obtaining a partition is

P ({X = P}) = P ({q} \times A_{P} (q) | {q} \times S_{n}) P ({q} \times S_{n}) = \frac{1}{S (n, k)},

which is the inverse of the number of elements in

Π_{k} (U)

. Therefore, Algorithm 3 implements sampling partitions according to the uniform distribution.

Let

n_{rep}

= 12,000. For each possible value of n and k from Table 1, we complete the following steps. We obtain

2 n_{rep}

partitions with Algorithm 3, let those partitions be

A_{1}

,…,

A_{2 n_{rep}}

. For each

i \in {1, \dots, n_{rep}}

, we compute

ϑ_{i} = θ (A_{2 i - 1}, A_{2 i})

by (15). We find the sample average

{\bar{ϑ}}^{*} = \frac{1}{n_{rep}} \sum_{i = 1}^{n_{rep}} ϑ_{i}

to compare it with

{\bar{θ}}^{*}

given n and k.

We define the estimate

δ^{*}

of the effect size

δ

by

\begin{matrix} δ^{*} = {\bar{ϑ}}^{*} - {\bar{θ}}^{*} . \end{matrix}

In the computation of the effect size estimates, we always assume that both

{\bar{ϑ}}^{*}

and

{\bar{θ}}^{*}

were determined under the same choice of a pair of numbers

n, k

. From the performance point of view, the larger the positive number of the effect size, the better.

The average adjusted transfer distance estimate

{\bar{ϑ}}^{*}

has its corresponding estimator

\bar{ϑ} = \frac{1}{n_{rep}} \sum_{i = 1}^{n_{rep}} {\hat{ϑ}}_{i}

, where

{\hat{ϑ}}_{1}, \dots, {\hat{ϑ}}_{n_{rep}}

are independent and identically distributed random variables:

{\hat{ϑ}}_{i} = θ (X_{1}, X_{2})

, where

X_{1}, X_{2}

are independent random partitions in

Π_{k} (U)

both with the uniform distribution.

To quantify the variability in the effect size estimation, we find an estimate for the standard error of the effect size estimator

\hat{δ} = \bar{ϑ} - \bar{θ}

. For n and k both fixed, the estimators

\bar{ϑ}, \bar{θ}

are independent since they are bound to different random experiments that have no relation to each other. Therefore, the standard error (SE) of

\hat{δ}

is

\begin{matrix} SE (\hat{δ}) = \sqrt{var (\hat{δ})} = \sqrt{var (\bar{ϑ}) + var (\bar{θ})} = \sqrt{{(SE (\bar{ϑ}))}^{2} + (SE ((\bar{θ})))^{2}} \end{matrix}

Estimate of the Monte Carlo standard error of

\bar{ϑ}

is

\begin{matrix} {SE}_{MC}^{*} (\bar{ϑ}) = \sqrt{\frac{1}{n_{rep} (n_{rep} - 1)} \underset{i = 1}{\sum^{n_{rep}}} {(ϑ_{i} - {\bar{ϑ}}^{*})}^{2}} . \end{matrix}

(22)

By replacing the standard errors of

\bar{ϑ}

and

\bar{θ}

by their Monte Carlo estimates, the formula for the Monte Carlo standard error of the effect size estimator becomes

\begin{matrix} {SE}_{MC}^{*} (\hat{δ}) = \sqrt{{({SE}_{MC}^{*} (\bar{ϑ}))}^{2} + {({SE}_{MC}^{*} (\bar{θ}))}^{2}}, \end{matrix}

(23)

where

{SE}_{MC}^{*} (\bar{ϑ}), {SE}_{MC}^{*} (\bar{θ})

are given by (22) and (17), respectively.

The independence assumption of

\bar{ϑ}

and

\bar{θ}

was practically taken care of by fixing the seed only once at the beginning of the computer program for the whole study (see pp. 2082–2083, [19]).

The reason behind the choice of the value of

n_{rep} =

12,000 was to make the standard error of

\bar{ϑ}

(and correspondingly also their estimates) as low as possible while keeping the execution time of the computer program affordable.

2.6. Simulation Algorithm

Algorithm 4 outlines the simulation study process of comparing the partitions obtained with Algorithm 1 with the partitions obtained by considering distribution functions of transformed multinomial random vectors (this is so-called ground truth) instead of empirical bootstrap distribution functions.

The steps of Algorithm 4 are as follows. The outer for-loop (lines 4 to 30) runs over all possible combinations of inputs 2 to 6. Input 1, the number of clusters, is introduced in the later part of the algorithm. First, the number of bootstrap replications is fixed to 1000 (line 5), the number of simulations in each scenario is fixed to 144 (line 6), and the model parameter coefficients are calculated (line 7). The middle-level for-loop (lines 8 to 29) runs over all simulations within a scenario, and it can be viewed as consisting of three parts.

Algorithm 4 Simulation study

1:: Data: set of scenarios $S = S_{1} \times S_{2} \times S_{3} \times S_{4} \times S_{5} \times S_{6}$ defined by (14)
2:: Result: $n_{sim} = 144$ adjusted transfer distance values for each scenario in $S$
3:
4:: for all $(N, d i s t, n, m, r) \in S_{2} \times S_{3} \times S_{4} \times S_{5} \times S_{6}$ do
5:: $R \leftarrow 1000$
6:: $n_{sim} \leftarrow 144$
7:: $c_{τ_{0}}$ is calculated by (6) for given $m$ and r for both $τ_{0} = 1$ and $τ_{0} = 2$
8:: for $rep = 1$ to $n_{sim}$ do
9:: for all $i \in {1, \dots, n}$ do
10:: if $d i s t = 1$ then
11:: $p_{i} \leftarrow$ sample prevalence according to (12)
12:: else if $d i s t = 2$ then
13:: $p_{i} \leftarrow$ sample prevalence according to (13)
14:: end if
15:: define $(B_{1}, B_{2}, B_{3}) \sim$ Multinomial( $N; p_{i}$ )
16:: $F_{i} \leftarrow$ distribution function of $c_{1} \frac{B_{1}}{N} + c_{2} \frac{B_{2}}{N}$
17:: $t_{i} \leftarrow$ N observations of a random variable T such that the
18:: j-th coordinate of $p_{i}$ is the probability that T takes the value j
19:: end for
20:: $U \leftarrow {1, \dots, n}$
21:: $D_{F} \leftarrow$ n-dimensional square matrix with entry $(i, j)$ calculated as
22:: $d_{S} (F_{i}, F_{j}) = {sup}_{x \in R} | F_{i} (x) - F_{j} (x) |$
23:: $Π_{1} \leftarrow$ HierClust( $U, D_{F}$ )
24:: $Π_{2} \leftarrow$ ProposedMet( ${t_{1}, \dots, t_{n}}, m, r, R$ )
25:: for all $k \in S_{1}$ do
26:: consider $s = (k, N, d i s t, n, m, r)$
27:: $t h e t a_{s}^{rep} \leftarrow$ Theta( $Π_{1}, Π_{2}, k$ )
28:: end for
29:: end for
30:: end for

The first part (lines 9 to 19) is a for-loop, which is basically a preparation for clustering. First, for each

i = 1, \dots, n

, the prevalence sampling is performed according to the distribution specified by input 3 (

d i s t

). After that, a multinomial random vector is defined with the number of trials equal to the value of input 2 (N) and the cell probabilities defined by the prevalence vector

p_{i}

just sampled. Next, distribution function

F_{i}

of the transformed multinomial random vector is defined (line 16), and a vector

t_{i}

of n observations is obtained by sampling from a categorical distribution with cell probabilities

p_{i}

.

The second part (lines 20 to 24) performs the clustering. Here U is the set of labels used for clustering, and

D_{F}

is the distance matrix, where entry

(i, j)

is the supremum distance between

F_{i}

and

F_{j}

. The distance matrix

D_{F}

is then used to find the ground-truth clustering

Π_{1}

. Clustering

Π_{2}

is an approximation to

Π_{1}

and is obtained with Algorithm 1. Both

Π_{1}

and

Π_{2}

are sets of nested partitions. Our task was to compare those partitions, and that was performed in the third part.

The third part (lines 25 to 28) performs the performance evaluation. The performance metric

θ

value is calculated for each value of the number of clusters k (input 1) based on

Π_{1}

and

Π_{2}

obtained under fixed values of all the other inputs 2 to 6. This leads to dependent performance data. It is computationally cheaper and should allow us to study the effect of k better.

3. Results

We studied the performance of our proposed method, outlined in Section 2.1.2, with the simulation study summarized in Section 2.2, Section 2.3, Section 2.4, Section 2.5 and Section 2.6. Recall from Section 2.5 that we considered two main performance measures: average adjusted transfer distance and the effect size. Before we turn to these performance measures, we present some graphs about the data based on which the performance measures were calculated.

Remark 1.

For those Figures below that have the horizontal axis entitled "Scenario", the explanation for the labels (from top to bottom) of top panels is: number of retained clusters (k), sample size (N), distribution for sampling prevalence (1—uniform discrete, 2—truncated geometric), number of populations (n), label for the expectation matrix (m). On the horizontal axis, all the 216 scenarios are grouped into 72 triples of scenarios: each grey vertical column corresponds to three scenarios, where all the other inputs are fixed, but the input r takes values 1, 5, 15. In each gray column, the left column corresponds to

r = 1

, the middle column to

r = 5

and the right column to

r = 15

. The explanation of how the scenarios for the inputs were ordered is also given at the end of Section 2.3.

Figure 2 shows the empirical scenario-specific distributions of the adjusted transfer distance estimates. The graphs of those 216 distributions are placed vertically in columns and were obtained with Algorithm 4; each distribution is based on

n_{sim}

simulations. We see that the distributions of the distance values are not identical across scenarios. The distribution of the distances seems to shift further from 0 as the number of retained clusters (k) increases, but we also notice the presence of the effect of the sample taken from each population (N), the distribution used for prevalence sampling, and perhaps also the number of populations n. In several scenarios when

k = 2

, the distribution of the distance is concentrated near 0.

With Algorithm 3 in Section 2.5.2 and the paragraph immediately below it, we explained how we applied so-called random partitioning in our study. The distributions of the adjusted transfer distance values obtained in random partitioning are illustrated in Figure 3. These distributions give us a sense of the typical distance values when the two partitions under comparison are independently obtained from the set of all k-class partitions so that each partition is equally likely. Roughly, if both partitions have two clusters, then typical distance values are

0.5

or just below it, for three clusters, distance values are more concentrated in the range of

0.5

to

0.6

, and in the case of four clusters, we mostly observe distances

0.58

to

0.7

. In all six cases considered, the reference distributions had a mean value between

0.42

and

0.65

, with the mean value increasing both with respect to the number of clusters k and with respect to the number of elements n in the base set used for clustering. Note that the mean values of the six distributions indicated by red dots in Figure 3 were used for calculating effect size estimates.

Recall from the last paragraph of step 8 in Section 2.1.2 that with our proposed method, not always clustering into a pre-specified number of clusters is possible. When we applied Algorithm 4, for the comparison method based on transformed multinomial distribution functions, we also used the same criterion as explained in Section 2.1.2 for determining whether or not a clustering into the desired number of clusters is possible. The function

θ

is a function of two variables (partitions). If, for fixed numbers n and k, either partition into k clusters was not possible, then (as explained with Algorithm 2) the value of

θ

was not possible to compute, leading to so-called NA values.

Figure 4 depicts the frequencies of NA values of the adjusted transfer distance as they occurred in the process of applying Algorithm 4 depending on the scenario number. Overall, there were 168 NA distance values out of the total of

n_{sim} S = 144 \cdot 216 =

31,104 simulations performed, so the proportion of NA values was approximately

0.54 %

among all simulations, and in none of the scenarios the proportion of NA values was higher than

6 / 144 \approx 4.2 %

. We pay attention to the fact that these NA values did not occur because of missing data in the usual sense. The occurrence of NA values was allowed by design; they were valid outcomes of the simulations since we used the variable-group hierarchical clustering method [13] accompanied by the ‘multidendrogram cutting technique’ for partitioning purposes. Ref. [13] argues that the multidendrogram is unique, but as our results show, when we use their method for partitioning purposes, this uniqueness comes with a price as we do not always obtain the needed number of clusters in a partition. However, the occurrence of NA values still appeared relatively low.

Since subsequent computations required us to use numerical data, our decision was to remove all observations with NA-adjusted transfer distance values. Therefore, all results that follow are conditional on the fact that in applying Algorithm 4, the value of the adjusted transfer distance is numerical.

Figure 5a provides a graphical overview of the values of one of the main performance measures that we computed—the scenario-specific average adjusted transfer distance estimates (

{\bar{θ}}^{*}

). The averages in Figure 5a were obtained from the respective distributions (plotted in Figure 2) resulting from the application of Algorithm 4. It is worth recalling from Section 2.5.1 that the lower the average distance, the better the performance. We observe that

{\bar{θ}}^{*}

values across all scenarios lie between

0.04

and

0.55

. The lowest

{\bar{θ}}^{*}

values occur in scenarios where discrete uniform distribution, a sample size of 500, and two retained clusters are used; in those scenarios, the average

θ

was always between

0.04

and

0.10

, which is at most 20% of the maximum possible

θ

value

0.5

in those scenarios. The highest average

θ

values occurred in scenarios where truncated geometric distribution and a sample size of 100 were used; then, the average

θ

ranged between

0.31

and

0.55

and was always more than

58 %

of the scenario-specific maximum

θ

value.

In Figure 5a, the patterns of the effects of each variable are perhaps a bit easier to notice than in Figure 2. Controlling for other inputs, the most notable effects have the following variables. Firstly, the distribution for sampling the prevalence; in all considered scenarios, the performance was worse when the prevalence was sampled with the truncated geometric distribution instead of the uniform distribution. As a reminder, the truncated geometric distribution was concentrated on values, which corresponded to the low total prevalence of the first two types of individuals in the branching process model. Second, the performance steadily worsened when the number of clusters k increased. Thirdly, the higher number of individuals sampled in each population N tended to improve the performance, but for this variable, the effect is not always very clear-cut. The same observations as for N apply to the number of populations n. Finally, the effects of the branching-process-related variables m and r were hardly noticeable. We continue the study of the effect of each variable below.

Figure 5b depicts the values of the second main performance measure—the effect size estimates that we computed to compare our method’s performance with the so-called random partitioning. We calculated the effect size estimate (

δ^{*} = {\bar{ϑ}}^{*} - {\bar{θ}}^{*}

) for each scenario, which is obtained by subtracting the average in Figure 5a in that scenario from the mean value of such reference distribution in Figure 3, which had the same n and k values as in the scenario chosen. Considering Figure 5b, we observe that all effect size estimates are positive, so our method performs better than random partitioning regarding effect size estimates. A remarkable feature of the results we see is that the scenarios appear to divide into two clusters in terms of effect size estimates: one cluster includes all scenarios with the sample size

N = 100

and the truncated geometric distribution for sampling prevalence (

d i s t = 2

) (see blue columns corresponding to scenarios 19–36, 91–108, 163–180); the other cluster includes all the remaining scenarios. For the truncated geometric distribution, together with the sample size of 100, all effect sizes were at most

0.19

. In all other scenarios, effect size estimates lie between

0.24

and

0.50

, with the highest values occurring in scenarios where the distribution is uniform (discrete), and the number of retained clusters is three or four. This phenomenon of division of scenarios into two clusters (same clusters as in Figure 5b) was also visible in Figure 5a. We interpret the phenomenon through the lens of combined effects: the performance dropped most remarkably when the smaller sample size

N = 100

and the truncated geometric distribution for sampling prevalence were present together; when either of those two factors was present alone, the decline in performance was less noticeable.

Figure 6 allows us to have a closer look at the sensitivity of the performance with respect to each variable separately. The plot shows the range of observed average adjusted transfer distance values for each variable separately over all combinations of the remaining variables (these values were obtained by summarizing the data shown in Figure 5a). The variables

d i s t

and N seemed to interact with each other, so we fixed the value of one of those variables to study the effect of the other (plots ‘dist, N = 100’, ‘n, dist = 2’, ‘dist, N = 500’, ‘N, dist = 1’). When we compare the plots in Figure 6, we see that the effect of

d i s t

on the average distance range is greater when

N = 100

compared to the case

N = 500

. Similarly, the effect of N is larger when

d i s t = 2

instead of

d i s t = 1

. The third variable among those with the highest influence on the spread of average distance is the number of clusters k, but this does not take into account the fact that the so-called baseline distance increased when k increased, as we saw in Figure 3. To account for this baseline increase in the average distance value, it is useful to study the effect of each of the variables on the effect size estimates (shown in Figure 5b) instead of the average distance estimates (shown in Figure 5a).

Figure 7 shows the distributions of the range of the effect size estimates calculated for each variable separately. The main difference compared to Figure 6 is that the distribution for variable k has shifted to the right, which is also reflected by the fact that the mean value decreased from

0.180

to

0.061

. At the same time, there was also a slight increase in the mean range value associated with variable n, but this shift was smaller in absolute size: from

0.059

to

0.091

. As an interpretation, we can say that the increase that we saw in Figure 5a in the average distance with respect to k when controlling for other variables is at least partially credited to the difficulty of getting partitions closer to each other when the number of clusters increases.

4. Discussion

Although bootstrap has been used in cluster analysis in various forms, recent studies have not addressed how the errors made in bootstrap estimation of transformations of the sample mean propagate to the hierarchical clustering results. In an attempt to provide some insights into this, we investigated this problem for hierarchical single linkage clustering of empirical bootstrap distribution functions parametrized by three-type Galton–Watson branching processes. To the authors’ best knowledge, this is the first work of this kind.

Our method relied on a three-type Galton–Watson branching process model, whose assumptions we discussed in Section 2.1.1. Our proposed method is applicable as long as the model’s assumptions are not violated. Actually, it is also valid in a particular case when model assumptions are violated; namely, the usage of the model in our method was through the formulas for the expected number of individuals up to the r-th generation over both types. For the validity of the formulas of this expected value, the assumption that the offspring vectors

X_{i j}^{(τ)}

are independent within a generation is not necessary. Therefore, our method is also applicable in the case of sibling interaction effects. However, this is just one aspect of the model, and overall careful examination is necessary in deciding whether or not the model is adequate to use.

To apply our method in practice, it is necessary to estimate the elements of the mean matrix

m

. Currently, our method assumes that the mean matrix is the same for all populations. It is possible to modify the method by allowing different mean matrices across populations, but how that would affect the performance is an open question.

The next process involved in the method was a sampling of types and estimating the model parameter. The samples of types of individuals should be random samples to be able to properly apply the nonparametric bootstrap. A sampling technique adequate for obtaining such samples is simple random sampling with replacement. In principle, other sampling techniques and statistical estimation methods could be used, provided they result in the empirical distribution function of the model parameter estimator that is needed in the subsequent step of the method—clustering.

For clustering purposes, we used the supremum distance to compute the distance between distribution functions. One of the properties of the supremum distance is its boundedness from above by 1. In some applications, this may cause clustering problems. For example, if the distributions under consideration have the supports concentrated in non-intersecting segments of the real line, the supremum distance between each distribution function will be one, and obtaining a reasonable clustering is rather hopeless. A solution is to use an unbounded distance function, such as the Wasserstein distance (the integral over the real line of the absolute difference of two distribution functions).

As for the interpretation of our study’s results, the performance of our proposed method was generally less promising when prevalence was distributed according to the truncated geometric distribution instead of the uniform distribution. Moreover, the results showed that when controlling for other inputs, the performance was better when the sample size associated with each process was higher. We can interpret that prevalence being distributed uniformly across processes corresponds to a balance between processes with low and high prevalence; prevalence distributed according to the truncated geometric distribution means that there are relatively more processes with low prevalence than high prevalence. Therefore, the study results suggest that when there is a large proportion of processes with low prevalence, the number of individuals sampled for each process should be increased. Besides the sample size, another variable that, in principle, can be controlled in practice is the number of retained clusters. The results showed that the performance steadily worsened when the number of retained clusters increased from two to four. It points out that if there is not an important reason to prefer to retain four clusters, two or three clusters should be preferred to obtain a better performance.

Type-specific reproduction numbers (labeled by m) and the number of generations (labeled by r) did not affect the overall performance. However, we emphasize that this conclusion does not necessarily mean that these inputs do not affect the resulting partitions. Our performance measure gave us information on how well partitions obtained with the proposed method match the reference partitions. But it was not our focus and we did not study the composition of the resulting partitions themselves, nor how the composition of partitions changed for inputs. These topics can be addressed in future studies.

By using the bootstrap estimation of the model’s parameter, not only is variability in prevalence estimates taken into account, but it also gives insight into asymptotic results regarding the performance of our proposed method. It is well-known that under some conditions, the bootstrap gives asymptotically correct results in estimating transformations of the sample mean. Also, our results showed that when the sample size increased, the performance tended to improve. In light of all this, one can try to prove that our proposed method is asymptotically correct, meaning that the partition obtained with our method approaches the reference partition when the sample size increases. This is left for future work.

Occasional impossibility to calculate the performance measure (

θ

) value due to ties did not lead to missing

θ

data in the ordinary sense—not assigned (NA) value of

θ

was a valid outcome of the simulation. We did all the analysis and answered all research questions, except those associated with NA values, by assuming that ties did not prevent partitioning into the desired number of clusters. Although we verified that problematic ties occurred with relatively low frequency, it would be desirable to extend the method so that ties cannot occur.

In conclusion, the results showed that our proposed method works quite well in some circumstances, and in most cases, ties were not a problem for applying our method. Therefore, the consistency of the nonparametric bootstrap carried over to hierarchical single linkage clustering promisingly.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math12152409/s1; The R Code; Research data.

Author Contributions

Conceptualization, L.V. and H.M.; methodology, L.V. and H.M.; software, L.V.; validation, L.V.; formal analysis, L.V. and H.M.; investigation: L.V. and H.M.; resources: H.M.; data curation: L.V.; Writing—original draft preparation, L.V.; Writing—review & editing, L.V. and H.M.; visualization: L.V.; supervision: H.M.; project administration, H.M.; funding acquisition: H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially financed by national funds through FCT—Fundação para a Ciência e a Tecnologia under the project UIDB/00006/2020. The APC was funded by Faculdade de Ciências, Universidade de Lisboa.

Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Materials, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

References

Liu, T.; Yu, H.; Blair, R.H. Stability estimation for unsupervised clustering: A review. Wiley Interdiscip. Rev. Comput. Stat. 2022, 14, e1575. [Google Scholar] [CrossRef] [PubMed]
Efron, B. Bootstrap methods: Another look at the jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
Hofmans, J.; Ceulemans, E.; Steinley, D.; Van Mechelen, I. On the added value of bootstrap analysis for k-means clustering. J. Classif. 2015, 32, 268–284. [Google Scholar] [CrossRef]
Peng, Q.; Rao, N.; Zhao, R. Some developments in clustering analysis on stochastic processes. arXiv 2019, arXiv:1908.01794. [Google Scholar]
Mahmoudi, M.R.; Baleanu, D.; Mansor, Z.; Tuan, B.A.; Pho, K.H. Fuzzy clustering method to compare the spread rate of COVID-19 in the high risks countries. Chaos Solitons Fractals 2020, 140, 110230. [Google Scholar] [CrossRef] [PubMed]
La Rocca, M.; Giordano, F.; Perna, C. Clustering nonlinear time series with neural network bootstrap forecast distributions. Int. J. Approx. Reason. 2021, 137, 1–15. [Google Scholar] [CrossRef]
Bulivou, G.; Reddy, K.G.; Khan, M. Stability estimation for unsupervised clustering: A review. IEEE Access 2022, 10, 117925–117943. [Google Scholar] [CrossRef]
Jagers, P. Branching Processes with Biological Applications; Wiley: Hoboken, NJ, USA, 1975; pp. 1–9, 87–91. [Google Scholar]
Bogdanov, A.; Kevei, P.; Szalai, M.; Virok, D. Stochastic modeling of in vitro bactericidal potency. Bull. Math. Biol. 2021, 84, 6. [Google Scholar] [CrossRef]
Taneyhill, D.E.; Dunn, A.M.; Hatcher, M.J. The Galton–Watson branching process as a quantitative tool in parasitology. Parasitol. Today 1999, 15, 159–165. [Google Scholar] [CrossRef] [PubMed]
Kinoshita, R.; Anzai, A.; Jung, S.M.; Linton, N.M.; Miyama, T.; Kobayashi, T.; Hayashi, K.; Suzuki, A.; Yang, Y.; Akhmetzhanov, A.R.; et al. Containment, contact tracing and asymptomatic transmission of novel coronavirus disease (COVID-19): A modelling study. J. Clin. Med. 2020, 9, 3125. [Google Scholar] [CrossRef] [PubMed]
van der Vaart, A.W. Asymptotic statistics. In Cambridge Series in Statistical and Probabilistic Mathematics; Cambridge University Press: Cambridge, UK, 1998. [Google Scholar] [CrossRef]
Fernández, A.; Gómez, S. Solving non-uniqueness in agglomerative hierarchical clustering using multidendrograms. J. Classif. 2008, 25, 43–65. [Google Scholar] [CrossRef]
Rudin, W. Principles of Mathematical Analysis, 3rd ed.; McGraw-Hill: New York, NY, USA, 1976. [Google Scholar]
Lance, G.N.; Williams, W.T. A General theory of classificatory sorting strategies: 1. Hierarchical systems. Comput. J. 1967, 9, 373–380. [Google Scholar] [CrossRef]
Meilă, M. Comparing clusterings: An axiomatic view. In Proceedings of the ICML ’05, 22nd International Conference on Machine Learning, New York, NY, USA, 7–11 August 2005; pp. 577–584. [Google Scholar] [CrossRef]
Charon, I.; Denoeud, L.; Guenoche, A.; Hudry, O. Maximum transfer distance between partitions. J. Classif. 2006, 23, 103–121. [Google Scholar] [CrossRef]
Day, W.H.E. The complexity of computing metric distances between partitions. Math. Soc. Sci. 1981, 1, 269–287. [Google Scholar] [CrossRef]
Morris, T.P.; White, I.R.; Crowther, M.J. Using simulation studies to evaluate statistical methods. Stat. Med. 2019, 38, 2074–2102. [Google Scholar] [CrossRef]
Varmann, L. Hierarchical Clustering Based on a Two-Type Branching Process Model: A Simulation Study. Master’s Thesis, Universidade de Lisboa, Faculdade de Ciências, Lisboa, Portugal, 2022. [Google Scholar]
Wu, C.; Thompson, M.E. Sampling Theory and Practice, 1st ed.; Springer: Cham, Switzerland, 2020. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R.; Taylor, J. Resampling Methods, 1st ed.; Springer: Cham, Switzerland, 2023; pp. 201–228. [Google Scholar] [CrossRef]
Meila, M. Criteria for Comparing Clusterings. In Handbook of Cluster Analysis, 1st ed.; Hennig, C., Meila, M., Murtagh, F., Rocci, R., Eds.; CRC Press: Boca Raton, FL, USA, 2016; pp. 619–635. [Google Scholar]

Figure 1. (Adapted from Fig. 3.1, [20]) There are n populations, each with three types of individuals with their corresponding type-specific prevalence values

p_{i j}

. From each population is taken a sample

t_{i}

of size N containing types of individuals. Based on those samples, the bootstrap is applied, yielding bootstrap samples

t_{i l}^{*}

. Branching process formulas are used in the step where bootstrap estimates

μ_{i l}^{*}

are calculated based on bootstrap samples

t_{i l}^{*}

. The functions

G_{i}

are empirical distribution functions of bootstrap estimates. These functions are used as input data for applying variable-group hierarchical clustering. Based on clustering output, we obtain a partition into k clusters,

U_{1} = {i_{n_{0}}, \dots, i_{n_{1}}}, \dots, U_{k} = {i_{n_{k - 1} + 1}, \dots, i_{n_{k}}}

(here

n_{0} = 1 < n_{1} < \dots < n_{k} = n

and

i_{1}, \dots, i_{n}

is a permutation of

1, \dots, n

).

Figure 1. (Adapted from Fig. 3.1, [20]) There are n populations, each with three types of individuals with their corresponding type-specific prevalence values

p_{i j}

. From each population is taken a sample

t_{i}

of size N containing types of individuals. Based on those samples, the bootstrap is applied, yielding bootstrap samples

t_{i l}^{*}

. Branching process formulas are used in the step where bootstrap estimates

μ_{i l}^{*}

are calculated based on bootstrap samples

t_{i l}^{*}

. The functions

G_{i}

are empirical distribution functions of bootstrap estimates. These functions are used as input data for applying variable-group hierarchical clustering. Based on clustering output, we obtain a partition into k clusters,

U_{1} = {i_{n_{0}}, \dots, i_{n_{1}}}, \dots, U_{k} = {i_{n_{k - 1} + 1}, \dots, i_{n_{k}}}

(here

n_{0} = 1 < n_{1} < \dots < n_{k} = n

and

i_{1}, \dots, i_{n}

is a permutation of

1, \dots, n

).

Figure 2. (Reproduced from Fig. 4.1, [20]) Frequency distributions of the adjusted transfer distance obtained in the simulation study (Algorithm 4) plotted against simulation scenarios. Note that there are ‘gaps’ in each of the 216 frequency distributions because the adjusted transfer distance has a discrete range of values. All frequency distributions are presented in a binned form: on the vertical axis, the interval from 0 to

0.70

is divided into 60 bins of equal length. The color of each bin indicates the number of simulations where the distance value fell in that bin. For the meanings of the panel labels on top of the chart, see Remark 1.

Figure 2. (Reproduced from Fig. 4.1, [20]) Frequency distributions of the adjusted transfer distance obtained in the simulation study (Algorithm 4) plotted against simulation scenarios. Note that there are ‘gaps’ in each of the 216 frequency distributions because the adjusted transfer distance has a discrete range of values. All frequency distributions are presented in a binned form: on the vertical axis, the interval from 0 to

0.70

is divided into 60 bins of equal length. The color of each bin indicates the number of simulations where the distance value fell in that bin. For the meanings of the panel labels on top of the chart, see Remark 1.

Figure 3. (Adapted from Fig. 4.3, [20]) Empirical reference frequency distributions of the adjusted transfer distance stratified with respect to the number of retained clusters (k) and the number of populations (n). The value of k changes horizontally along panels, and the value of n changes vertically. Each horizontal axis at the bottom of the figure applies to both figures lying above the axis. A red dot at the bottom of each frequency diagram pins the location of the expected value (

{\bar{ϑ}}^{*}

) of the corresponding distribution. With four decimal precision, these expected values are (from left to right in the upper row, then from left to right in the bottom row):

0.4280

,

0.5476

,

0.5965

,

0.4492

,

0.5836

,

0.6428

. The highest observed value of the Monte Carlo standard error of

\bar{ϑ}

over all scenarios for n and k was bounded from above by

5.2 \cdot 10^{- 4}

.

Figure 3. (Adapted from Fig. 4.3, [20]) Empirical reference frequency distributions of the adjusted transfer distance stratified with respect to the number of retained clusters (k) and the number of populations (n). The value of k changes horizontally along panels, and the value of n changes vertically. Each horizontal axis at the bottom of the figure applies to both figures lying above the axis. A red dot at the bottom of each frequency diagram pins the location of the expected value (

{\bar{ϑ}}^{*}

) of the corresponding distribution. With four decimal precision, these expected values are (from left to right in the upper row, then from left to right in the bottom row):

0.4280

,

0.5476

,

0.5965

,

0.4492

,

0.5836

,

0.6428

. The highest observed value of the Monte Carlo standard error of

\bar{ϑ}

over all scenarios for n and k was bounded from above by

5.2 \cdot 10^{- 4}

.

Figure 4. (Reproduced from Fig. 4.4, [20]) Frequencies of not assigned (NA) adjusted transfer distance values plotted against scenarios. There were no NA values in the scenarios for which the bar is missing from the figure (for example, scenario 64). For the meanings of the panel labels on top of the chart, see Remark 1.

Figure 5. (a) Reproduced from (Fig. 4.5 (a), [20]), (b) adapted from (Fig. 4.5 (b), [20]). (a) Average adjusted transfer distance (

{\bar{θ}}^{*}

) value plotted against simulation scenario. (b) Effect size estimates (

δ^{*}

) plotted against simulation scenarios. The highest observed value over all scenarios of the Monte Carlo standard error of

\bar{θ}

and

\hat{δ}

were both

0.011

(with the precision of three digits after the dot). For the meanings of the panel labels on top of both plot (a) and plot (b), see Remark 1.

Figure 5. (a) Reproduced from (Fig. 4.5 (a), [20]), (b) adapted from (Fig. 4.5 (b), [20]). (a) Average adjusted transfer distance (

{\bar{θ}}^{*}

) value plotted against simulation scenario. (b) Effect size estimates (

δ^{*}

) plotted against simulation scenarios. The highest observed value over all scenarios of the Monte Carlo standard error of

\bar{θ}

and

\hat{δ}

were both

0.011

(with the precision of three digits after the dot). For the meanings of the panel labels on top of both plot (a) and plot (b), see Remark 1.

Figure 6. (Reproduced from Fig. 4.6, [20]) Empirical frequency diagram of the maximum value minus the minimum observed value of

{\bar{θ}}^{*}

given that the values of all variables except one are fixed. For example, each data point upon which the graph entitled ‘k’ depended was found like this: for each of

216 / 3 = 72

combinations of values for inputs dist, N, n, m,r, we calculated the range (maximum minus minimum) of

{\bar{θ}}^{*}

when k was allowed to vary over all of its three possible values. The other diagrams with one label were obtained analogously. An equality in a panel label shows that the graph was found under the assumption of that equality, e.g., for a graph with the title ’dist, N=100’, each data point was found like this: for each of the

216 / (2 \cdot 2) = 54

combinations of values for inputs k, n, m,r, N such that N was fixed to one of its two values 100, we calculated the range (maximum minus minimum) of

{\bar{θ}}^{*}

when dist was allowed to vary over two of its possible values. The red dots show the average values of the corresponding empirical distributions. The eight graphs are ordered decreasingly in the average value: from the top left corner to the top right corner, and from the bottom left corner to the bottom right corner, the average values are

0.251

,

0.219

,

0.180

,

0.059

,

0.059

,

0.027

,

0.019

,

0.015

.

Figure 6. (Reproduced from Fig. 4.6, [20]) Empirical frequency diagram of the maximum value minus the minimum observed value of

{\bar{θ}}^{*}

given that the values of all variables except one are fixed. For example, each data point upon which the graph entitled ‘k’ depended was found like this: for each of

216 / 3 = 72

combinations of values for inputs dist, N, n, m,r, we calculated the range (maximum minus minimum) of

{\bar{θ}}^{*}

when k was allowed to vary over all of its three possible values. The other diagrams with one label were obtained analogously. An equality in a panel label shows that the graph was found under the assumption of that equality, e.g., for a graph with the title ’dist, N=100’, each data point was found like this: for each of the

216 / (2 \cdot 2) = 54

combinations of values for inputs k, n, m,r, N such that N was fixed to one of its two values 100, we calculated the range (maximum minus minimum) of

{\bar{θ}}^{*}

when dist was allowed to vary over two of its possible values. The red dots show the average values of the corresponding empirical distributions. The eight graphs are ordered decreasingly in the average value: from the top left corner to the top right corner, and from the bottom left corner to the bottom right corner, the average values are

0.251

,

0.219

,

0.180

,

0.059

,

0.059

,

0.027

,

0.019

,

0.015

.

Figure 7. Empirical frequency diagram of the maximum value minus the minimum observed value of

δ^{*}

given that the values of all variables except one are fixed. The meaning of the panel labels is the same as in Figure 6. The red dots show the average values of the corresponding empirical distributions. From the top left corner to the top right corner, and from the bottom left corner to the bottom right corner, the average values are: 0.251, 0.219, 0.061, 0.059, 0.091, 0.027, 0.019, 0.015.

Figure 7. Empirical frequency diagram of the maximum value minus the minimum observed value of

δ^{*}

given that the values of all variables except one are fixed. The meaning of the panel labels is the same as in Figure 6. The red dots show the average values of the corresponding empirical distributions. From the top left corner to the top right corner, and from the bottom left corner to the bottom right corner, the average values are: 0.251, 0.219, 0.061, 0.059, 0.091, 0.027, 0.019, 0.015.

Table 1. Inputs for the simulation study and their possible values.

Input	Meaning	Notation	Set of Values
1	number of retained clusters	k	$S_{1} = {2, 3, 4}$
2	sample size	N	$S_{2} = {100, 500}$
3	distribution for sampling prevalence	$d i s t$	$S_{3} = {1, 2}$
4	number of elements to cluster	n	$S_{4} = {30, 60}$
5	labels for vectorized form of $m$	m	$S_{5} = {1, 2, 3}$
6	number of generations	r	$S_{6} = {1, 5, 15}$
7	set of type-specific prevalence		${P_{3}}$
8	simulations per scenario	$n_{sim}$	${144}$
9	number of bootstrap samples	R	${1000}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Varmann, L.; Mouriño, H. Clustering Empirical Bootstrap Distribution Functions Parametrized by Galton–Watson Branching Processes. Mathematics 2024, 12, 2409. https://doi.org/10.3390/math12152409

AMA Style

Varmann L, Mouriño H. Clustering Empirical Bootstrap Distribution Functions Parametrized by Galton–Watson Branching Processes. Mathematics. 2024; 12(15):2409. https://doi.org/10.3390/math12152409

Chicago/Turabian Style

Varmann, Lauri, and Helena Mouriño. 2024. "Clustering Empirical Bootstrap Distribution Functions Parametrized by Galton–Watson Branching Processes" Mathematics 12, no. 15: 2409. https://doi.org/10.3390/math12152409

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Clustering Empirical Bootstrap Distribution Functions Parametrized by Galton–Watson Branching Processes

Abstract

1. Introduction

2. Simulation Study

2.1. Methods

2.1.1. The Model

2.1.2. Proposed Method

2.2. Aims

2.3. Data-Generating Mechanisms and Design

2.4. Estimands

2.5. Performance Measures

2.5.1. Average Adjusted Transfer Distance and Its Monte Carlo Error

2.5.2. Effect Size and Its Monte Carlo Error

2.6. Simulation Algorithm

3. Results

4. Discussion

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI