1. Introduction
Even when we receive information under the same conditions, our point of view may greatly differ from others’. Therefore, if we want to analyze expert knowledge, such differences should be considered.
Figure 1 shows a representation of this problem. Our point of view on certain information depends on our cognitive skills and external factors that might change our beliefs.
Expert knowledge elicitation (EKE) has the goal of producing, via elicitation, a probabilistic distribution that represents the expert’s knowledge around a parameter of interest. For that purpose, we can adopt the Delphi method as an elicitation method. The latter is defined by Brown (1968) [
1] as a technique based on the results of multiple rounds of questionnaires sent to a panel of experts and whose purpose is to reach a consensus on their opinion. Such method is effective, as it allows a group of individuals to address a complex problem and could be implemented to obtain a single representation of experts’ beliefs through a probability distribution. However, this method proves difficult when the number of experts in the study increases considerably.
Finding the mean of the level of certainty of each expert using their personal distributions is another way to obtain a prior distribution of expert knowledge. Nevertheless, it could be erroneous, as shown in
Figure 2, where, for instance, we observe two hypothetical experts’ prior distributions (red and black curves) representing their level of knowledge over a proportion of
, and the mean of the level of the two experts (green curve), which does not represent the actual level of certainty of each expert. As a result, the entire complex elicitation work done for each expert is wasted. Hence, we believe that the probability distributions of each expert can be classified and their opinions represented using clusters of beliefs. Thus, Bayesian inference can be carried out in parallel by considering each cluster of priors and a decision can be arrived at via experts’ criteria (Barrera-Causil et al., 2019) [
2].
On the other hand, classifying probability distributions is an essential task in different areas. Clustering methods to classify word distribution histograms in information retrieval systems have been implemented successfully [
3]. For instance, Henderson at al. (2015) [
4] present three illustrations with airline route data, IP traffic data, and synthetic data sets to classify distributions. Therefore, functional data analysis (FDA) could be used for clustering distributions since it is an extension of the multivariate methods where observations are represented by curves in a function space [
5]. Some tools of multivariate analysis have been extended to the functional context pointwise, considering the implementation of multivariate procedures around the real interval where these functions are defined. Thus, in many cases, the curves are discretized to implement statistical procedures.
Cluster analysis is one of multiple techniques that have been extended to FDA, and different methods are implemented to obtain partitions of curves within its framework. Some of those methods have been compared to determine their performance and make recommendations in several situations [
6]. For instance, Abraham et al. [
7] proposed a two-stage clustering procedure in which each observation is approximated by a
B-spline in the first stage, and the functions are grouped using the
k-means algorithm in the second stage. Gareth and Sugar (2003) [
8] presented a model-based approach for clustering functional data. Their method was effective when the observations were sparse, irregularly spaced, or occurred at different time points for each subject. Serban and Wasserman (2005) [
9] proposed a technique for nonparametrically estimating and clustering a large number of curves. In their method, the nearly flat curves are removed from the analysis, while the remaining curves are smoothed and finally grouped into clusters.
Other alternatives can also be found in the literature. For instance, Shubhankar and Bani (2006) [
10] proposed a nonparametric Bayes wavelet model for clustering curves based on a mixture of Dirichlet processes. Song et al., (2007) [
11] presented a FDA-based method to cluster time-dependent gene expression profiles. Chiou et al. (2007) [
12] developed a Functional Clustering (FC) method (i.e.,
k-centres FC) for longitudinal data. Their approach accounts for both the means and the modes of variation differentials between clusters by predicting cluster membership with a reclassification step. Tarpey (2007) [
13] applied the
k-means algorithm for clustering curves under linear transformations of their regression coefficients. More recently, Goia et al., (2010) [
14] used a functional clustering procedure to classify curves representing the maximum daily demand for heating measurements in a district heating system. Hébrail et al., (2010) [
15] proposed an exploratory analysis algorithm for functional data. Their method involves finding
k clusters in a set of functions and representing each cluster with a piecewise constant, seeking simplicity in the construction of the clusters. Boullé (2012) [
16] presented a novel method to analyze and summarize a collection of curves based on a piecewise constant density estimation where the curves are partitioned into clusters. Furthermore, Secchi et al., (2012) [
17] focused on the problem of clustering functional data indexed by the sites of a spatial finite lattice. Jacques and Preda (2013) [
18] presented a model-based clustering algorithm for multivariate functional data based on multivariate functional principal components analysis. The references in Jacques and Preda (2013) [
19] are of particular importance because they summarize the main contributions in the field of functional data clustering. Other clustering algorithms have been reported in the literature. For instance, Ferreira and Hitchcock (2009) [
6] compared four hierarchical clustering algorithms on functional data: single linkage, complete linkage, average linkage, and Ward’s method (T these methods are implemented in the
agnes function of
R. This function takes several arguments:
x, a data matrix, data frame, or dissimilarity matrix;
metric which is the metric to be used for calculating dissimilarities between observations (by default, it is the
euclidean distance); and
method, a character string specifying the clustering method (single linkage, complete linkage, average linkage, or Ward’s method)). Ferreira and Hitchcock (2009) [
6] found that Ward’s method and average linkage outperform their counterparts.
Although there is research relating to the clustering of functions, no study has considered functional clustering of experts’ beliefs. For example, Stefan et al., (2021) [
20] studied the effect of interpersonal variation in elicited prior distributions on Bayesian inference. In their study one of the six experts exhibited discrepant distributions. Thus, it would be ideal to have a method able to numerically address discrepancies among clusters of elicited prior distributions. Another important situation is when a researcher needs to make a decision based on information obtained from elicited priors. In all these cases, and according to the problem showed in
Figure 2, differences between priors should be addressed and the estimation, either posterior or prior, must be done in parallel for each group of elicited priors. Thus, in this paper we propose a new method to deal with multiple elicited prior distributions. The method thus allows clustering distributions using FDA and the Hellinger’s distance (Simpson, 1987) [
21]. Hellinger’s distance enables to quantify the similarity between two probability distributions and, we believe, it is a more appropriate metric than the current metrics for functional data. An illustration of the place of the proposed method within the expert knowledge elicitation workflow is shown in
Figure 3.
This proposal is motivated by the interest of offering a new tool for the analysis of prior curves from multiple experts when elicitation is used. However, in addition to offering an alternative to the problem posed in
Figure 2, or to the complexity involved in applying the Delphi method with a considerable number of experts, this proposal can be implemented to detect atypical curves, or even to create clusters in fuzzy multicriteria decision-making problems (Kahraman, Onar, and Oztaysi, 2015) [
22].
To test the efficiency of the clustering algorithm our method, we propose a hierarchical clustering technique for functional data and compare it, via statistical simulation, with functional
k-means, Ward’s, and average linkage methods (these methods are implemented in
R [
23] through the
kmeans.fd function in the
fda.usc package (Febrero-Bande and Oviedo, 2012) [
24] and the
agnes function in the
cluster package [
25]. The latter function performs agglomerative nesting clustering). To examine the similarity between two clusters, we considered the Rand index, the Fowlkes–Mallows index, the Jaccard coefficient—index to measure similarity between sample sets—and the correct classification rate (Hubert and Arabie, 1985; Morlini and Zani, 2012) [
26,
27]. Note that reporting the outcomes of all these indexes allows seeing patterns in the results and it is in line with practices to enhance transparency in research (Steegen et al., 2016) [
28]. Finally, the application of our method is illustrated using real data sets.
The paper is structured as follows:
Section 2 introduces some theoretical approaches and details the proposed method for clustering density functions and/or functional data.
Section 3 describes the simulation study and its results.
Section 4 shows an example using real data sets to illustrate the use of the proposed method.
Section 5 presents the main contributions of this paper and suggests topics for further research. Finally,
Section 6 contains the supplementary material and three additional illustrations using different data sets (see
Appendix A).
3. Simulation Study
For simulation purposes, density functions were used as functional data. The theoretical counterpart of these densities was estimated using kernel functions via the density() function in R. To assess the performance of our method, we compared it with the functional k-means, Ward’s, and average linkage methods for functional data as implemented in the kmeans.fd(), agnes(method="ward") and agnes(method="average") functions in R, respectively. In the simulation study, we generated overlapping distributions and overlapping clusters of distributions based on the two following definitions:
Definition 1. Overlapping distributions. Let X and Y be random variables with density functions and , respectively, both sharing the same support, and define as the γ-th percentile of Z. Then, we can say that f and g overlap if one of the following conditions is satisfied:
- (a)
∧
- (b)
∧ .
Definition 2. Overlapping clusters of distributions. A cluster is δ-overlapped with another one if at least of those distributions is overlapped. If two clusters of distributions are not overlapped, it means they are separated.
Initially, the clusters generated contained a finite number of curves following a Normal distribution with pre-specified means and variances for each real group but with a random perturbation of the parameters within clusters. Therefore, three clusters with curves per cluster were considered. All of the cases above were simulated considering two and three clusters of curves.
We also considered asymmetrical features to compare the behavior of the four methods under different scenarios and thus modified the previously described simulation process (clusters of Normal distributions) using Gamma
and Beta
distributions within clusters. A further major consideration in all cases was the use of an atypical curve for each simulation scenario to evaluate the performance of the method in the presence of atypical curves. To study the effectiveness of the method under evaluation, we considered separated and
-overlapped clusters of distributions with a known value of
. The average overlapping rate in each scenario is presented in
Table 2.
In all the simulation scenarios, the existence of clusters of curves was ensured by testing the equality of their means.
Figure 4 shows one of the different simulation scenarios used to compare the four methods. These scenarios were considered for two and three clusters, with
curves per cluster, and an equal number of curves. In each simulation scenario, a total of 1000 replicates were run. To validate the clusters found at each replicate, we used the routines in the
clv package [
31] in
R. The performance of the four methods compared here was assessed using the Rand index, the Fowlkes–Mallows index (F.M), the Jaccard coefficient, and the correct classification rate. These measures are defined by the equations below.
Given a set of n elements and two partitions of S to be compared: , a partition of S into r subsets, and , a partition of S into s subsets, define the following:
a: the number of pairs of elements in S that are in the same set in X and in the same set in Y.
b: the number of pairs of elements in S where both elements belong to different clusters in both partitions.
c: the number of pairs of elements in S where both elements belong to the same cluster in partition X but not in partition Y.
d: the number of pairs of elements in S that are in different sets in X and in the same set in Y.
For some, , , , , , , , , .
The Rand index (R), the Jaccard coefficient (J), the Fowlkes–Mallows index (F.M), and the correct classification rate (CCR) are thus calculated as follows:
All these validation measures have a value between 0 and 1, where 0 indicates that no pair of points is shared by the two data clusters, and 1 means that the data clusters are exactly the same.
Results
The main results of this simulation study are shown in
Figure 5 and
Figure 6 (R codes and data associated with this article can be found at
https://figshare.com/projects/An_FDA-based_approach_for_clustering_expert_knowledge/73437 (accessed on 4 March 2021)). In general, and regardless of the performance index we used, our proposed method performs better than
kmean.fd, Ward’s, and average linkage methods in terms of recreating the cluster structure in the data (
Figure 5). Therefore, the clustering method based on the Hellinger distance proposed herein is a better choice for classifying density curves (or, for that matter, functional data).
The assertion above is justified from two perspectives. First, when clusters are separated or overlapped or an atypical curve is present in simulated data from Normal or Gamma populations, there is evidence that our method performs better than the other ones. Second, when two clusters from a Gamma population are generated, the kmean.fd and Ward’s methods, as well as our proposed method, performed equally well. However, in any other scenario, our method outperforms the others.
Although the
kmean.fd method and our proposal perform equally well in almost all the scenarios, our method outperforms others when three clusters of Gamma
curves are considered regardless of their degree of overlap (
Figure 5). When compared with Ward’s method, ours has a slightly lower performance when
curves are generated for two clusters of curves obtained from a Gamma
distribution. A similar behavior is observed when two clusters of overlapped Beta
curves are generated, or when
curves from two clusters of a Beta
distribution with an atypical curve are generated (
Figure 6). In the former case, our proposal exhibits a slightly lower performance than the other alternatives. However, in general, our method and the
kmean.fd algorithm performed equally well. In the latter case, Ward’s method performs slightly better than our proposed method.
Figure 6 illustrates the performance measures of the four methods compared in this study to obtain optimal partitions. Overall, all the methods studied here exhibit poor performance compared to that of our proposal. The result seems to be a promissory topic for further exploration, especially for clusters generated from a Beta
distribution because this is a very flexible distribution that can take different shapes. The closed support of a Beta distribution leads to seeking clusters under more concentrated possibilities, turning every method into a more overlapping one and making it more challenging to detect the
true differences within these curves. Thus, our proposed method exhibits the best performance compared to the other options considered here.
5. Discussion
In this paper, we proposed a simple method to segment expert knowledge that can be effectively applied to functional data or problems with large volumes of data (data mining). Implementing such method, we built, after a discretization process, a distance matrix between curves using the Hellinger distance.
A simulation study considering different scenarios was presented. In such study, our proposed method performed better in almost all scenarios than the k-means and agglomerative nesting clustering algorithms for functional data (as implemented in the kmean.fd and agnes functions in R, respectively). Therefore, it proved to be a useful tool to perform a cluster analysis with distributions elicited from experts’ personal beliefs. Based on the computers’ lifetime example, we conclude that the proposed clustering method can be used to segment expert knowledge. Furthermore, it can identify the expert with the highest level of expertise, which is very important when analyzing experts’ beliefs considering their different points of view.
Along the same lines of this proposal, further research topics in this field include sensitivity analyses when different initial values are considered. Another interesting area, based on the results of the analyses above, is developing a robust clustering method to generate data partitions, namely a clustering method that performs well in different conditions. Additionally, other distributions (e.g., Poisson), samples, and cluster sizes should be considered in future simulation studies to further assess the performance of our method.
Finally, we believe our proposed method has implications for two specific areas; combination of expert knowledge and machine learning (ML), and Bayesian inference. Supervised (e.g., classification) and unsupervised (e.g., clustering) ML methods have gained momentum in current research. For example, it has been shown that ML-based classification ensembles perform better than experts in segregating viable from non-viable embryos [
35] and that (evolutionary) clustering algorithms enable characterising COVID-19-related textual tweets in order to assist institutions in making decisions [
36]. In general, researchers pitch ML-based analyses against human expert knowledge because they see no value in it or because do not know how to integrate into the analysis pipeline. Evidence, however, suggests that blending expert knowledge with ML analyses leads to better predictions [
37,
38,
39,
40]. We believe our proposal can be used to contribute in this rather under-investigated area.
A challenge in Bayesian inference is to determine the effect of different prior distributions on a parameter to be estimated. In this regard, Stefan et al., (2021) [
20] found that the variability in prior distributions from six experts did not affect the qualitative conclusions associated to estimated Bayes factors. Their results thus suggest that although there can be quantitative differences between the elicited prior distributions, the overall qualification associated to those quantities is not altered. We believe, though, that if the elicited distributions show considerable fluctuation in terms of location, scale and shape, those distributions need to be subjected to clustering in order to segregate levels of expertise (see
Figure 2,
Figure 7 and
Figure 9). The method proposed herein can serve for such purpose.