Next Article in Journal
Dynamic Graph Learning for Session-Based Recommendation
Previous Article in Journal
A Refined Theory for Bending Vibratory Analysis of Thick Functionally Graded Beams
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fuzzy Clustering Methods with Rényi Relative Entropy and Cluster Size

1
Department of Statistics and Operations Research, Universidad Complutense de Madrid, 28040 Madrid, Spain
2
Comisión Nacional del Mercado de Valores, 28006 Madrid, Spain
*
Author to whom correspondence should be addressed.
Mathematics 2021, 9(12), 1423; https://doi.org/10.3390/math9121423
Submission received: 14 May 2021 / Revised: 12 June 2021 / Accepted: 15 June 2021 / Published: 18 June 2021
(This article belongs to the Section Engineering Mathematics)

Abstract

:
In the last two decades, information entropy measures have been relevantly applied in fuzzy clustering problems in order to regularize solutions by avoiding the formation of partitions with excessively overlapping clusters. Following this idea, relative entropy or divergence measures have been similarly applied, particularly to enable that kind of entropy-based regularization to also take into account, as well as interact with, cluster size variables. Particularly, since Rényi divergence generalizes several other divergence measures, its application in fuzzy clustering seems promising for devising more general and potentially more effective methods. However, previous works making use of either Rényi entropy or divergence in fuzzy clustering, respectively, have not considered cluster sizes (thus applying regularization in terms of entropy, not divergence) or employed divergence without a regularization purpose. Then, the main contribution of this work is the introduction of a new regularization term based on Rényi relative entropy between membership degrees and observation ratios per cluster to penalize overlapping solutions in fuzzy clustering analysis. Specifically, such Rényi divergence-based term is added to the variance-based Fuzzy C-means objective function when allowing cluster sizes. This then leads to the development of two new fuzzy clustering methods exhibiting Rényi divergence-based regularization, the second one extending the first by considering a Gaussian kernel metric instead of the Euclidean distance. Iterative expressions for these methods are derived through the explicit application of Lagrange multipliers. An interesting feature of these expressions is that the proposed methods seem to take advantage of a greater amount of information in the updating steps for membership degrees and observations ratios per cluster. Finally, an extensive computational study is presented showing the feasibility and comparatively good performance of the proposed methods.

1. Introduction

Clustering analysis [1,2,3,4], as a part of the multivariate analysis of data, is a set of unsupervised methods whose objective is to group objects or observations taking into account two main requirements: (1) objects in a group or cluster should be similar or homogeneous; and (2) objects from different groups should be different or heterogeneous.
Historically, clustering methods have been divided into two classes attending to the output they provide: A first class is that of hierarchical methods [5,6], which proceed by either agglomerating or splitting previous clusters, respectively, ending or starting the clustering process with all objects grouped in a single cluster. Thus, the main feature of hierarchical methods is that they provide a collection of interrelated partitions of the data, each partition or set of clusters being composed of a different number of clusters between 1 and N, where N is the number of objects or observations being clustered. This leaves the user with the task of choosing a single partition exhibiting an adequate number of clusters for the problem being considered. The second class of clustering methods is that of non-hierarchical procedures, which provide a single partition with a number of clusters either determined by the algorithm or prespecified by the user as an input to the method. Several general clustering strategies have been proposed within this non-hierarchical class, such as prototype-based [7,8,9,10,11,12], density-based [13,14], graph-based [15,16,17,18], or grid-based [19,20] methods. Prototype-based methods typically assume the number K of clusters is provided by the user, randomly construct an initial representative or prototype for each of the K clusters, and proceed by modifying these prototypes in order to optimize an objective function, which usually measures intra-cluster variance in terms of a metric such as the Euclidean distance.
The most extended instance of prototype-based method is the K-means algorithm, first proposed by Forgy [7] and further developed by MacQueen [8]. As for most prototype-based methods, the basis of the K-means algorithm is a two-step iterative procedure, by which each object is assigned to the cluster associated with the closest prototype (i.e., from a geometric point of view, clusters can be seen as the Voronai cells of the given prototypes, see [21,22]), and prototypes are then recalculated from the observations that belong to each corresponding cluster. The K-means specific prototypes are referred to as cluster centroids and are obtained as the mean vectors of observations. The method can thus be seen as a heuristic procedure trying to minimize an objective function given by the sum of squared Euclidean distances between the observations and the respective centroids. As shown in [23], the K-means algorithm has a superlineal convergence, and it can be described as a gradient descent algorithm.
Several works (e.g., [24,25,26,27]) discuss the relevance of the K-means algorithm and its properties, problems, and modifications. Particularly, its main positive features are as follows [25,27,28]:
  • Its clustering strategy is not difficult to understand for non-experts, and it can be easily implemented with a low computational cost in a modular fashion that allows modifying specific parts of the method.
  • It presents a linear complexity, in contrast with hierarchical clustering methods, which present at least quadratic complexity.
  • It is invariant under different dataset orderings, and it always converges at quadratic rate.
On the other hand, the K-means algorithm also presents some known drawbacks:
  • It assumes some a priori knowledge of the data, since the parameter K that determines the number of clusters to build has to be specified before the algorithm starts to work.
  • It converges to local minima of the error function, not necessarily finding the global minimum [29], and thus several initializations may be needed in order to obtain a good solution.
  • The method of selection of the initial seeds influences the convergence to local minima of the error function, as well as the final clusters. Several initialization methods have been proposed (see [25] for a review on this topic), such us random selection of initial points [8] or the K-means++ method [30], which selects the first seed randomly and then chooses the remaining seeds with a probability inversely proportional to distance.
  • Although it is the most extended variant, the usage of the Euclidean metric in the objective function of the method presents two relevant problems: sensitivity to outliers and a certain bias towards producing hyperspherical clusters. Consequently, the method finds difficulties in the presence of ellipsoidal clusters and/or noisy observations. Usage of other non-Euclidean metrics can alleviate these handicaps [24].
The introduction of the notions and tools of fuzzy set theory [31] in cluster analysis was aimed at relaxing the constraints imposing that each observation has to completely belong to exactly one cluster. These constraints lead to consider all observations belonging to a cluster as equally associated to it, independently of their distance to the corresponding centroid or the possibility of lying in a boundary region between two or more clusters. However, it may seem natural to assume that observations which lie further from a cluster centroid have a weaker association to that cluster than nearer observations [24]. Similarly, detecting those observations that could be feasibly associated to more than one cluster can be a highly relevant issue in different applications. In this sense, the fuzzy approach enables observations to be partially associated to more than one cluster by modeling such association through [0,1]-valued membership degrees, rather than by a binary, 0 or 1 index. These ideas led to the proposal of the Fuzzy C-means method (FCM, initially introduced in [32] and further developed in [33]), which extended the K-means method by allowing partial or fuzzy membership of observations into clusters and triggered research in the field of fuzzy clustering.
Later, the idea of applying a regularization approach to some ill-defined problems in fuzzy clustering led to the consideration of entropy measures as a kind of penalty function, since a large entropy value indicates a high level of randomness in the assignation of observations to clusters or similarly a high level of overlapping between clusters (see [34,35,36,37] for an ongoing discussion of the notion of overlap in fuzzy classification). This idea is first implemented in [38] by modifying the FCM to minimize Shannon’s entropy [39]. Similarly, entropy measures are also applied to deal with the difficulties found by the FCM on non-balanced cluster size datasets, where misclassifications arise as observations from large clusters are classified in small clusters or big clusters are divided into two clusters, as shown for instance in [21]. Indeed, by introducing new variables to represent the ratios of observations per cluster, these problems can be addressed through the use of relative entropy measures or divergences between the membership functions and such ratios [21,40,41], for example Kullback–Leiber [42] and Tsallis [43] divergences, respectively, applied in the methods proposed in [44,45].
The aim of this paper is then to follow and extend the ideas of those previously mentioned works by studying the application of Rényi divergence [46] in the FCM objective function as a penalization term with regularization functionality. In this sense, a relevant feature of Rényi’s entropy and divergence is that, due to their parametric formulation, they are known to generalize other entropy and divergence measures; particularly, they, respectively, extend Shannon’s entropy and Kullback–Leiber divergence. Furthermore, the kind of generalization provided by Rényi entropy and divergence is different from that provided by Tsallis entropy and divergence. This makes their application in the context of fuzzy clustering interesting, as more general clustering methods can be devised from them. Moreover, as discussed in Section 3, the iterative method derived from the proposed application of Rényi divergence seems to make a more adaptive usage of the available information than previous divergence-based methods, such as the previously referenced one [45] based on Tsallis divergence.
Indeed, as far as we know, two previous proposals apply either Rényi entropy or divergence to define new clustering methods. The first one is found in [47], where Rényi divergence is taken as a dissimilarity metric that is used to substitute the usual variance-based term of the FCM objective function, therefore without a regularization aim. The second proposal [48] does however employ Rényi entropy as a regularization term in the context of fuzzy clustering of time data arrays, but it does not consider observations ratios per cluster, and thus it does not apply Rényi divergence as well.
Therefore, the main contribution of this work consists of the proposal of a fuzzy clustering method using Rényi divergence between membership degrees and observations ratios per cluster as a regularization term, thus adding it to the usual variance term of FCM objective function when considering cluster sizes. Looking for some more flexibility in the formation of clusters, a second method is then proposed that extends the first one by introducing a Gaussian kernel metric. Moreover, an extensive computational study is also carried out to analyze the performance of the proposed methods in relation with that of several standard fuzzy clustering methods and other proposals making use of divergence-based regularization terms.
This paper is organized as follows. Section 2 introduces the notation to be employed and recalls some preliminaries on fuzzy clustering, entropy and divergence measures, and kernel metrics. It also reviews some fuzzy clustering methods employing divergence-based regularization. Section 3 presents the proposed method and its kernel-based extension. Section 4 describes the computational study analyzing the performance of the proposed methods and comparing it with that of a set of reference methods. Finally, some conclusions are exposed in Section 5.

2. Preliminaries

This section is devoted to recalling some basics about fuzzy clustering, entropy measures, and kernel functions needed for the understanding of the proposed methods. In this way, Section 2.1 introduces notations and reviews some of the most extended prototype-based clustering methods. Section 2.2 recalls general notions on entropy and divergence measures and describes some clustering methods making use of divergence measures. Finally, Section 2.3 provides the needed basics on kernel metrics.

2.1. Some Basics on Clustering Analysis

Let X be a set of N observations or data points with n characteristics or variables at each observation. We assume X is structured as a N × n data matrix, such that its ith row Xi = (xi1,…,xin)T n denotes the ith observation, i = 1,…,N. It is also possible to assume that observations are normalized in each variable (e.g., by the min-max method) in order to avoid scale differences, so we can consider the data to be contained in a unit hypercube of dimension n.
The centroid-based clustering methods we deal with are iterative processes that classify or partition the data X into K clusters { C k t } k = 1 K in several steps t = 1,2,..., with t T , where T is the set of iterations or steps. At each iteration t, the partition of X in the K clusters is expressed through a N × K matrix Μ t = [ μ i k t ] ( N , K ) , such that μ i k t denotes the membership degree of observation Xi into cluster Ckt. Notation M is used to represent a generic N × K partition matrix. In a crisp setting, only allowing for either null or total association between observations and clusters, it is μ i k t { 0 , 1 } , while in a fuzzy setting it is μ i k t [ 0 , 1 ] in order to allow partial association degrees. In either case, the constraints
k = 1 K μ i k t = 1   for   any   i = 1 , , N   and   t T
(Ruspini condition [49]), and
0 < i = 1 N μ i k t < N   for   any   k = 1 , , K   and   t T ,
are imposed to guarantee the formation of data partitions with non-empty clusters. Let us remark that fuzzy methods with μ i k t [ 0 , 1 ] do not actually need to impose Equation (2), since in practice the iterative expressions for the membership degrees obtained by only using Equation (1) as constraint already guarantee that μ i k t > 0 for all k and i.
Similarly, the centroid Vkt = (vkt1,…,vktn)T of the kth cluster at iteration t is obtained for each method through a certain weighted average of the observations (using membership degrees to cluster Ckt as weights), and the K × n matrix with the n coordinates of all K centroids at a given iteration t is denoted by Vt. We also employ notation V to denote a generic centroid K × n matrix. Initial centroids or seeds { V k 0 } k = 1 K are needed to provide the starting point for the iterative process, and many initialization methods are available to this aim [25].
Iterative prototype-based methods can be seen as heuristic procedures intended to minimize a certain objective function J ( M , V ) , which is usually set to represent a kind of within-cluster variance [8] through squared distances between observations and centroids. Thus, although the aim of any clustering procedure is to find the global minimum of such objective function, iterative heuristic methods can only guarantee convergence to a local minimum [29], which is usually dependent on the selected initial centroids or seeds. Only the Euclidean distance or metric d ( X i , X j ) = [ ( X i X j ) T ( X i X j ) ] 1 / 2 for any X i ,   X j n is employed in this paper (although we also implicitly consider other metrics through the usage of kernel functions, see Section 2.3). For simplicity, the distance between an observation X i and a centroid V k t is denoted by d i k t = d ( X i , V k t ) and its square by d i k t 2 = d ( X i , V k t ) 2 .
The iterative process of the methods discussed in this paper typically consider a number of steps at each iteration, which at least include: (i) updating the centroids from the membership degrees obtained in the previous iteration; and (ii) updating the membership degree of each observation to each cluster taking into account the distance from the observation to the corresponding (updated) cluster centroid. In this regard, without loss of generality (it is a matter of convention where an iteration starts), we always assume that, at each iteration, centroids are updated first, and thus that initial degrees μ i k 0 for all i and k have to be also obtained from the initial seeds { V k 0 } k = 1 K .
Different stopping criteria can be used in order to detect convergence of the iterative process or to avoid it extending for too long: a maximum number of iterations t m a x can be specified, and a tolerance parameter ε > 0 can be used so that the method stops when the difference between centroids of iterations t and t + 1 is below it, max k | V k ( t + 1 ) V k t | < ε , or similarly when the difference between membership functions of iterations t and t + 1 is small enough, max i , k | μ i k ( t + 1 ) μ i k t | < ε .
Finally, as mentioned above, some difficulties may arise when applying the FCM method on non-balanced datasets, i.e., when clusters present quite different sizes. Such difficulties can be addressed by expanding the optimization problem through the introduction of a set of new variables ϕ k t [ 0 , 1 ] , k = 1,…,K, called observations ratios per cluster [21,40] as the constraint
k = 1 K ϕ k t = 1   for   any   t T
is also imposed. These variables allow weighting the contribution of each cluster to the objective function as inversely proportional to the cluster’s size, facilitating in practice the formation of clusters with different sizes. Notation Φ is used to represent a generic vector ( ϕ 1 , , ϕ K ) of observations ratios. In those methods making use of observations ratios, we assume that these are updated at the end of each iteration, from updated centroids and membership degrees. This entails that initial ratios ϕ k 0 , k = 1,…,K, have to be somehow obtained from the initial centroids and membership degrees. Moreover, the iterative formulae for membership degrees and observations ratios provided by some methods show an interdependence between both kinds of variables. As a consequence, specific expressions to initialize both membership degrees and observations ratios for t = 0 have to be provided in these cases. In this sense, for the computational experiments described in Section 4, we applied Equations (44) and (45) to this aim for all methods requiring such specific initialization.

2.1.1. K-Means

In terms of the notation just introduced and assuming a fixed number of cluster K, the K-means method consists in a two-step iterative process that tries to find the partition matrix M and the centroids V that solve the optimization problem
min M , V J ( M , V ) = min M , V k = 1 K i = 1 N μ i k t d i k t 2 , t T
subject to μ i k t { 0 , 1 }   i , k , t , as well as to Equations (1) and (2).
Given a set of initial seeds { V k 0 } k = 1 K , as exposed above, we assume that initial membership degrees μ i k 0 { 0 , 1 } are obtained by assigning each observation to the cluster with the nearest centroid, in a crisp way, i.e., μ i k 0 = 1 when d ( X i , V k 0 ) = min l = 1 , , K d ( X i , V l 0 ) and μ i k 0 = 0 otherwise. Then, at each iteration t > 0, the method first updates the centroids through the formula
V k t = 1 N k ( t 1 ) i = 1 N μ i k ( t 1 ) X i
where N k ( t 1 ) = i = 1 , , N μ i k ( t 1 ) denotes the number of observations belonging to cluster C k ( t 1 ) , i.e., its cluster size. Next, observations’ membership degrees are updated, again assigning each observation to the cluster with the nearest (updated) centroid, i.e., μ i k ( t 1 ) when d ( X i , V k t ) = min l = 1 , , K d ( X i , V l t ) and μ i k t = 0 otherwise. These two steps are repeated until a stopping criterion is met.

2.1.2. Fuzzy C-Means

This method was proposed by Dunn [32] as a generalization of the K-means allowing to consider partial association degrees between observations and clusters. Later, Bezdek [33] improved and generalized Dunn’s proposal by introducing the fuzzifier parameter m > 1, which allows controlling the fuzziness of the clusters. Therefore, although apparently similar to the K-means, the standard FCM algorithm allows the elements μ i k t of the partition matrices Mt to be continuous degrees in the unit interval [0,1] instead of binary indexes in {0,1}, and thus tries to solve the optimization problem
min M , V J m ( Μ , V ) = min M , V i = 1 N k = 1 K μ i k t m d i k t 2 ,     t T
subject to μ i k t [ 0 , 1 ]   i , k , t as well as to Equations (1) and (2). Departing from a set of initial seeds { V k 0 } k = 1 K , at each iteration, it is possible to apply the method of Lagrange multipliers in order to find a local minimum of Equation (6) verifying the imposed constraints. This leads to update the centroids following the expression
V k t = i = 1 N μ i k ( t 1 ) m   X i i = 1 N μ i k ( t 1 ) m     t > 0
as well as to obtain the membership degrees as
μ i k t = [ d i k t 2 ] 1 ( m 1 ) / g = 1 K [ d i g t 2 ] 1 ( m 1 )   t 0
The limit behavior of the FCM method in terms of the fuzzifier parameter m is studied for instance in [50]: when m tends to infinity, the centroids converge to the global average of data, and, when m tends to 1, the method tends to be equivalent to the K-means. In addition, in [50], different results on the convergence of the FCM are proved, which guarantee its convergence to a local or global minimum or to a saddle point of Equation (6). Similar results for alternative FCM methods and generalizations are demonstrated in [51,52]. In [53], it is shown that Equation (6) is convex only when Μ or V is fixed. The FCM has basically the same advantages and disadvantages discussed in Section 1 for the K-means (see also [24] for a detailed comparison between K-means and FCM).

2.1.3. Fuzzy C-Means with Cluster Observations Ratios

A modification of the FCM method is proposed in [21,40] with the idea of adding cluster size variables Φ , facilitating the formation of clusters of different sizes. The optimization problem the modified method tries to solve is
min M , Φ , V J m ( Μ , Φ , V ) = min M , Φ , V i = 1 N k = 1 K ϕ k t 1 m μ i k t m d i k t 2 , t T    
subject to μ i k t [ 0 , 1 ]   i , k , t as well as to Equations (1)–(3). The application of Lagrange multipliers to Equation (9) leads as before to Equation (7) for centroid updating, as well as to obtain new membership degrees by following the expression
μ i k t = ( ϕ k ( t 1 ) 1 m d i k t 2 ) 1 m 1 g = 1 K ( ϕ g ( t 1 ) 1 m d i g t 2 ) 1 m 1   t > 0
Finally, the ratios of observations per cluster are to be computed as
ϕ k t = ( i = 1 N μ i k t m d i k t 2 ) 1 m g = 1 K ( i = 1 N μ i g t m d i g t 2 ) 1 m     t > 0
Notice that, due to the interdependence between membership degrees and observations ratios, an initialization method has to be provided to obtain these when t = 0. As mentioned above, Equations (44) and (45) are applied to this aim for this and the following methods.

2.2. Entropy and Relative Entropy in Cluster Analysis

This section aims to recall the definitions of some entropy and relative entropy measures and describes a couple of fuzzy clustering methods that employ divergence measures with a regularization functionality.

2.2.1. Entropy Measures

The notion of information entropy was a cornerstone in the proposal of a general information and communication theory by Shannon in the mid-20th century [39]. Given a random variable A , the information entropy of A can be understood as the mathematical expectation of the information content of A , thus measuring the average amount of information (or, equivalently, uncertainty) associated with the realization of the random variable. When applied to fuzzy distributions verifying probabilistic constraints, such as membership degrees to clusters for which the Ruspini condition [49] holds, entropy can be understood as a measure of the fuzziness of such distribution [45]. In this context, entropy is maximum when overlap between the different classes or clusters is present. This has motivated the application of entropy measures in fuzzy clustering as a penalization term aiming to regularize the solution of clustering problems by avoiding too overlapping partitions. Furthermore, entropy measures have also found a relevant application in recent developments in non-fuzzy clustering ensembles [54,55]. Next, we recall the definitions of the entropy measures proposed by Shannon, Tsallis, and Rényi.
Definition 1
[39]. Let A be a discrete random variable with finite support χ = { x 1 , , x s } and probability mass function P ( A ) . Then, the Shannon entropy of A is defined as
S ( A ) = x i χ P ( x i ) ln P ( x i )
Definition 2
[43]. Let A be a discrete random variable with finite support χ = { x 1 , , x s } and probability mass function P ( A ) . Then, Tsallis entropy of A is defined as
T ( A ) = ( 1 x i χ P ( x i ) q ) / ( q 1 )
where the parameter q is interpreted as the degree of non-extensiveness or pseudo-additivity.
Tsallis entropy was introduced in clustering analysis for non-extensive statistics in the context of physical statistics and thermodynamics [43]. Notice that Equation (13) generalizes Equation (12), since, when q 1 , Equation (12) is obtained.
Definition 3
[46]. Let A be a discrete random variable with finite support χ = { x 1 , , x s } and probability mass function P ( A ) . Then, Rényi entropy of A is defined as
R ( A ) = ln ( x i χ P ( x i ) α ) / ( 1 α )
where the parameter α > 0 indicates the order of the entropy measure.
Notice that Equation (14) generalizes several other entropy measures, such as Shannon entropy (when α 1 , see [46]), Harley entropy (when α 0 , see [56]), minimum entropy ( when   α , see again [46]), and collision entropy (when α = 2 , see [57]). Notice also that Equation (14) is a generalized mean in the Naguno–Kolmogorov sense [46].

2.2.2. Relative Entropy

Following the authors of [58,59], given two discrete random variables A and   with the same finite support χ = { x 1 , , x s } and probability mass functions P ( A ) and Q ( ) , the relative entropy in the context of statistics would be understood as the loss of information that would result from using instead of A . If this loss of information were null or small, the value of the divergence D ( A ) would be equal to zero or close to zero, while, if the difference is large, the divergence would take a large positive value. In addition, the divergence is always positive and convex, although, from the metric point of view, it is not always a symmetrical measure, that is D ( A ) D ( A ) .
Next, we recall the definitions of Kullback–Leiber, Tsallis, and Rényi relative entropy measures.
Definition 4
[42]. In the above conditions, the Kullback–Leiber relative entropy is given by
D K L ( A ) = x i χ P ( x i ) ln ( P ( x i ) Q ( x i ) )
Definition 5
[43]. In the above conditions, the Tsallis relative entropy is given by
D T ( A ) = ( x i χ P ( x i ) [ ( P ( x i ) Q ( x i ) ) q 1 1 ] ) / ( q 1 )
where the parameter q has a similar interpretation as in Definition 2.
Definition 6
[46]. In the above conditions, the Rényi relative entropy is given by
D R ( A ) = 1 ( α 1 ) ln ( x i χ Q ( x i ) 1 α P ( x i ) α )
where the parameter α > 0 provides the order of the divergence measure as in Definition 3.
In [60], it is shown that Equation (17) generalizes Equation (15) when α 1 . In addition, when α = 1 / 2 , the Bhattacharyya’s distance [61] divided by 2 is obtained.

2.2.3. Fuzzy C-Means with Kullback–Leiber Relative Entropy and Cluster Size

The application of Kullback–Leiber relative entropy in fuzzy clustering is proposed in [41], using Equation (15) as a regularization term to penalize solutions with a high level of overlapping between clusters. In that work, Equation (15) is added to the FCM objective function exposed in Section 2.1.2 (without the fuzzifier parameter m), substituting the probability distribution P ( A ) in Equation (15) for the fuzzy membership degrees in the partition matrix M, as well as Q ( ) for the observations ratios per cluster Φ . A parameter ζ > 0 is introduced to weight the influence of the regularization term. In this way, the method proposed in [41] aims to solve the optimization problem
min M , Φ , V J ζ ( M , Φ , V ) = min M , Φ , V i = 1 N k = 1 K μ i k t d i k t 2 + ζ ( i = 1 N k = 1 K μ i k t   l n ( μ i k t ϕ k t ) ) , t T
subject to μ i k t [ 0 , 1 ]   i , k , t as well as to Equations (1)–(3).
Minimizing Equation (18) by the method of Lagrange multipliers again leads to updating centroids through Equation (7), as well as computing membership degrees by following the expression
μ i k t = ϕ k ( t 1 )   e x p ( d i k ( t 1 ) 2 / ζ ) / g = 1 K ϕ g ( t 1 )   e x p ( d i g ( t 1 ) 2 / ζ )   t > 0
Finally, the obtained formula for the ratios of observations per cluster is
ϕ k t = i = 1 N μ i k t / N   t > 0

2.2.4. Fuzzy C-Means with Tsallis Relative Entropy and Cluster Size

A first proposal for the application of Tsallis entropy in fuzzy clustering is given in [62], using Equation (13) without considering cluster size variables. Later, Tsallis relative entropy, Equation (16), is applied in [45] as a penalty function into a regularization term, which leads to a method that attempts to solve the optimization problem
min M , Φ , V J m , ζ ( M , Φ , V ) = min M , Φ , V i = 1 N k = 1 K ϕ k t 1 m     μ i k t m d i k t 2 + ζ ( m 1 ) ( i = 1 N k = 1 K ϕ k t 1 m μ i k t m i = 1 N k = 1 K μ i k t ) ,   t T  
subject to μ i k t [ 0 , 1 ]   i , k , t as well as to Equations (1)–(3). Notice that this objective function builds on the objective function of the FCM with cluster observations ratios exposed in Section 2.1.3, therefore considering the fuzzifier parameter m in the intra-cluster variance term along with the regularization parameter ζ > 0 weighting the Tsallis relative entropy penalization term, in which the parameter m plays the role of the non-extensiveness parameter q.
Using the Lagrange multipliers method on Equation (21) and its constraints again leads to Equation (7) for centroid updating and calculating the partition matrix elements at each iteration as
μ i k t = ( ϕ k ( t 1 ) 1 m ( d i k t 2 + ζ ( m 1 ) ) ) 1 m 1 / g = 1 K ( ϕ g ( t 1 ) 1 m ( d i g t 2 + ζ ( m 1 ) ) ) 1 m 1   t > 0
while the obtained expression for the ratio of observations per cluster is
ϕ k t = ( i = 1 N μ i k t m ( d i k t 2 + ζ ( 1 m ) ) ) 1 m / g = 1 K ( i = 1 N μ i g t m ( d i g t 2 + ζ ( 1 m ) ) ) 1 m   t > 0

2.3. Kernel Metrics

When the observations X present separability issues in the metric space n in which they are contained, a fruitful strategy may be that of mapping them into a greater dimension space n , with n > n , through an adequate transformation φ : n   n such that the data become separable in this new space. This can allow the formation of more separated clusters than in the original space, potentially leading to obtain a better partition of the observations X. This idea, which has been successfully implemented in diverse data analysis proposals, e.g., support vector machines [63] or clustering analysis [64], is typically referred to as the kernel trick, since it is possible to apply it through the so-called kernel functions without explicitly providing the transformation φ . The reason behind this is that, when only distances between mapped observations need to be obtained, these can be calculated through a kernel function   K : n × n such that K ( X i ,   X j ) =   φ ( X i ) , φ ( X j ) for any X i ,   X j n , where · , · denotes the inner product of the metric space n [65].
In this way, squared distances between mapped observations and centroids in n can be easily obtained through a kernel function, since it holds that
d i k 2 = d ( X i , V k ) 2 = φ ( X i ) φ ( V k ) 2 =
= φ ( X i ) , φ ( X i ) + φ ( V k ) , φ ( V k ) 2 φ ( X i ) , φ ( V k ) =
= K ( X i ,   X i ) + K ( V k ,   V k ) 2 K ( X i ,   V k )
In this work, we apply the Gaussian or radial basis kernel given by the function
K ( X i ,   V k ) = e x p ( γ X i V k 2 )
where the parameter γ > 0 controls the proximity between the transformed observations. Indeed, since substituting Equation (27) into Equation (26) leads to
d i k 2 = 2 ( 1 e x p ( γ X i V k 2 ) )
it is easy to see that, when γ increases, d i k 2 approximates to 2 more quickly, so each observation tends to become isolated; however, when γ tends to zero, all observations tend to be part of the same neighborhood. Furthermore, as for instance shown in [64], for a Gaussian kernel centroid, updating can be performed by simply introducing the kernel function in the expression for centroid calculation, in such a way that Equation (7) is just translated as
V k t = i = 1 N μ i k t m K ( X i ,   V k ( t 1 ) )   X i i = 1 N μ i k t m K ( X i ,   V k ( t 1 ) )
A Gaussian kernel version of the FCM method was first proposed by the authors of [64], although an earlier version is introduced in [66] using an exponential metric without explicit reference to the usage of kernel functions.

3. Fuzzy Clustering with Rényi Relative Entropy

This section presents the main proposal of this work, consisting in the application of Rényi relative entropy between membership degrees and observations ratios per cluster as a regularization term in the context of fuzzy clustering. Thus, our proposal situates and builds on the line of previous works studying the usage of relative entropy measures for improving the performance of fuzzy clustering methods, and particularly that of the FCM method, as reviewed in the last section.
The specific motivation for our proposal is to study whether the implementation of Rényi relative entropy as a regularization term of membership degrees along with cluster sizes can lead to obtain better iterative formulas to find optimal partitions than those provided by other methods, such as the FCM itself or its extensions based on the Kullback–Leiber or Tsallis relative entropy measures.
Let us point out that, to the extent of our knowledge, there are only two previous proposals making use of Rényi entropy or relative entropy in the context of fuzzy clustering: in [47], Equation (17) is used as a dissimilarity metric without a regularization functionality, while, in [48], Equation (14) is applied with a regularization aim in the context of the application of FCM to time data arrays, but without taking into account observations ratios per cluster. Furthermore, neither the work in [47] nor that in [48] makes use of kernel metrics.
Therefore, in this work, we propose a first method using Equation (17) as a penalization function in a regularization term taking into account both membership degrees and observations ratios per cluster, which is expected to provide better results than just using a regularization term based on Equation (14) without cluster sizes, as in [48]. Moreover, our proposal adds Rényi relative entropy regularization to the FCM objective function already taking into account cluster sizes, contrary to using an objective function based only in Equation (17) without considering intra-cluster variance, as in [47]. The second proposed method extends the first one by introducing a Gaussian kernel metric, thus enabling some more flexibility in the formation of clusters than that provided by the first method.
To describe our proposal, this section is divided into two subsections, one for each of the proposed methods. In these subsections, we first present the proposed objective function and constraints of the corresponding method, and then expose as a theorem the expressions obtained for the iterative formulas through the application of Lagrange multipliers on the objective function. Next, the proof of the theorem is given. Finally, the steps of the algorithm associated to each method are detailed.

3.1. Fuzzy C-Means with Rényi Relative Entropy and Cluster Size

As mentioned, the objective function of the first proposed method follows the idea of Equations (18) and (21), in this case adding a regularization term based on Rényi relative entropy to the usual FCM objective function when considering observations ratios per cluster. Therefore, this method seeks to solve the optimization problem
min M , Φ , V J m , ζ ( M , Φ , V ) = min M , Φ , V i = 1 N k = 1 K ϕ k t 1 m μ i k t m d i k t 2 + ζ ( m 1 ) ln ( i = 1 N k = 1 K ϕ k t 1 m μ i k t m )
subject to Equations (1) and (3).
We do not explicitly impose Equation (2) avoiding empty clusters since, as can be seen in Equation (31) below, the obtained iterative expression for membership degrees trivially guarantees that these are always greater than 0. Besides, let us remark that the objective function in Equation (30) adds a penalization function based on Rényi relative entropy, Equation (17), to the objective function of the FCM when considering observations ratios per cluster, Equation (9). As in Equations (18) and (21), the parameter ζ > 0 is introduced to weight the influence of this regularization term. Notice also that, similarly to Equation (21), the fuzzifier parameter m > 1 plays the role of the order parameter α in the Rényi-based term.
Next, the Lagrange multipliers method is applied to the previous optimization problem with objective function Equation (30) in order to obtain iterative expressions for the membership degrees μ i k t and the observations ratios per cluster ϕ k t , as well as for updating the centroids V k t . It is important to stress that, since the constraints in Equations (1) and (3) are orthogonal for any i = 1 , , N , t T , the resolution of the Lagrange multipliers method for several constraints can be handled separately for each constraint. Moreover, as the centroids V k t are unconstrained, the Lagrangian function defined for their optimization is equal to Equation (30).
Theorem 1.
The application of the Lagrange multipliers method to the objective function Equation (30) constrained by Equations (1) and (3) provides the following solutions t > 0 .
For updating the centroids Equation (7):
V k t = ( i = 1 N μ i k ( t 1 ) m   X i ) / ( i = 1 N μ i k ( t 1 ) m ) ,   k = 1 , , K
For membership degrees:
μ i k t = ( ϕ k ( t 1 ) 1 m ( d i k t 2 + ζ   ( i = 1 N k = 1 K ϕ k ( t 1 ) 1 m μ i k ( t 1 ) m ) 1 ( m 1 ) ) ) 1 m 1 g = 1 K ( ϕ g ( t 1 ) 1 m ( d i g t 2 + ζ   ( i = 1 N g = 1 K ϕ g ( t 1 ) 1 m μ i g ( t 1 ) m ) 1 ( m 1 ) ) ) 1 m 1 ,   i = 1 , , N , k = 1 , , K
For observations ratios per clusters:
ϕ k t = ( i = 1 N μ i k t m d i k t 2 + ζ ( i = 1 N μ i k t m ) ( m 1 ) ( i = 1 N g = 1 K ϕ g ( t 1 ) 1 m μ i g t m ) ) 1 m   g = 1 K   ( i = 1 N μ i g t m d i g t 2 + ζ ( i = 1 N μ i g t m ) ( m 1 ) ( i = 1 N g = 1 K ϕ g ( t 1 ) 1 m μ i g t m ) ) 1 m ,   k = 1 , , K
Proof. 
First, in order to obtain the centroids updating formula, and taking into account that d i k t 2 = [ ( X i V k t ) T ( X i V k t ) ] and the centroids are unconstrained, the derivative of Equation (30) with respect to V k t is equaled to zero, leading to
i = 1 N ϕ k t 1 m μ i k t m   [ 2 ( X i V k t ) ] = 0 ,   k = 1 , , K
Equation (7) easily follows by solving for V k t in Equation (33) and replacing each unknown μ i k t with the corresponding known μ i k ( t 1 ) from the previous iteration. Equation (7) provides a local minimum as the second derivative of Equation (30) is positive.
Then, addressing the optimization of the membership degrees μ i k t , we build the Lagrangian function associated to Equation (30) restricted to the conditions imposed by Equation (1):
L m , ζ ( M , Φ , V ) = i = 1 N k = 1 K ϕ k t 1 m   μ i k t m d i k t 2 + ζ ( m 1 ) ln ( i = 1 N k = 1 K ϕ k t 1 m μ i k t m ) i = 1 N λ i t ( k = 1 K μ i k t 1 )
By taking the derivative of Equation (34) with respect to μ i k t and equaling to zero, the following is obtained:
μ i k t = ( λ i t   ϕ k t m 1 / m   d i k t 2 + ζ   ( i = 1 N g = 1 K ϕ g t 1 m μ i g t m ) 1 ( m 1 ) ) 1 m 1   i = 1 , , N , k = 1 , , K
Since at iteration t both μ i k t and ϕ k t in the right-side of Equation (35) have to be considered unknown, we, respectively, approximate them by μ i k ( t 1 ) and ϕ k ( t 1 ) . This leads to
μ i k t = ( λ i t   ϕ k ( t 1 ) m 1 / m   d i k t 2 + ζ   ( i = 1 N g = 1 K ϕ g ( t 1 ) 1 m μ i g ( t 1 ) m ) 1 ( m 1 ) ) 1 m 1   i = 1 , , N , k = 1 , , K
Now, for each i = 1 , , N , we impose that the μ i k t as given in Equation (36), k = 1 , , K , have to fulfill the corresponding constraint in Equation (1), that is
1 = k = 1 K ( λ i t   ϕ k ( t 1 ) m 1 / m d i k t 2 + ζ   ( i = 1 N g = 1 K ϕ g ( t 1 ) 1 m μ i g ( t 1 ) m ) 1 ( m 1 ) ) 1 m 1   i = 1 , , N
Solving for λ i t   in Equation (37), we get
λ i t   = m / [ k = 1 K ( ϕ k ( t 1 ) 1 m d i k t 2 +   ϕ k ( t 1 ) 1 m ζ ( m 1 ) ( i = 1 N g = 1 K ϕ g ( t 1 ) 1 m μ i g ( t 1 ) m ) ) 1 m 1 ] m 1   i = 1 , , N
Then, Equation (31) is obtained by replacing λ i t   in Equation (36) with Equation (38). It is straightforward to check that Equation (36) is a local minimum as the second derivative of Equation (34) is positive.
Now, addressing the optimization of the observations ratios per cluster ϕ k t , the Lagrangian function associated to Equation (30) restricted to Equation (3) is
L m , ζ ( M , Φ , V ) = i = 1 N k = 1 K ϕ k t 1 m     μ i k t m d i k t 2 + ζ ( m 1 ) ln ( i = 1 N k = 1 K ϕ k t 1 m μ i k t m ) λ t ( i = 1 K ϕ k t 1 )
Taking the derivative of Equation (39) with respect to ϕ k t and equaling to zero, results in
ϕ k t = [ 1 m λ t ( i = 1 N μ i k t m d i k t 2 + ζ ( i = 1 N μ i k t m ) ( m 1 ) ( i = 1 N g = 1 K ϕ g t 1 m μ i g t m ) ) ] 1 m k = 1 , , K
Now, at this point of iteration t, it is possible to consider that the membership degrees μ i k t ,   i = 1 , , N , k = 1 , , K , are known. However, the ratios’ variables ϕ g t , g = 1 , , K , are still unknown, and have thus to be approximated by the corresponding ϕ g ( t 1 ) . Then, we get
ϕ k t = [ 1 m λ t ( i = 1 N μ i k t m d i k t 2 + ζ ( i = 1 N μ i k t m ) ( m 1 ) ( i = 1 N k = 1 K ϕ g ( t 1 ) 1 m μ i g t m ) ) ] 1 m   k = 1 , , K
By imposing that the ϕ k t , k = 1 , , K , as given by Equation (41), fulfill the constraint in Equation (3), it follows that
1 = k = 1 K (   [ 1 m λ t ( i = 1 N μ i k t m d i k t 2 + ζ ( i = 1 N μ i k t m ) ( m 1 ) ( i = 1 N g = 1 K ϕ g ( t 1 ) 1 m μ i g t m ) ) ] 1 m )
and by solving in Equation (42) for the Lagrange multiplier λ t   , it is
λ t = ( 1 m ) [ k = 1 K [ i = 1 N μ i k t m d i k t 2 + ζ ( i = 1 N μ i k t m ) ( m 1 ) ( i = 1 N g = 1 K ϕ g ( t 1 ) 1 m μ i g t m ) ] 1 m ] m
Equation (32) is then obtained by replacing Equation (43) in Equation (41). Equation (32) is also a local minimum as it is not difficult to check that the second derivative of Equation (39) is greater than zero. □
A first remark regarding the updating formulae provided by Theorem 1 has to refer to the approximation of the membership degrees and observations ratios of a given iteration t by the corresponding degrees and ratios of the previous iteration t−1. As shown in the previous proof, for a given k, Equations (31) and (32) for μ i k t and ϕ k t , respectively, depend on μ i k t and ϕ k t themselves. We simply avoid this difficulty by relying on the corresponding values from the previous iteration to substitute these unknown quantities, leading to Equations (36) and (41). This is a common practice when deriving iterative formulae for fuzzy clustering analysis, although one does not typically find the same unknown quantity at both sides of an expression. However, as shown by the computational study described in the next section, in this case, this strategy seems to work well in practice.
A second remark concerns the initialization of the membership degrees and observations ratios per cluster. In a similar way to the methods reviewed in Section 2, the mentioned interdependence between these quantities, as reflected in Equations (31) and (32), makes it necessary to provide alternative expressions for computing them at the beginning of the process, i.e., for t = 0. To this aim, we propose Equation (44) to initialize membership degrees:
μ i k 0 = ( 1 K 1 ) ( 1 d i k 0 2 / g = 1 K d i g 0 2 ) ,   i = 1 , , N ,   k = 1 , , K
as well as Equation (45) for the observations ratios:
ϕ k 0 = 1 N i = 1 N [ 1 + s i g n ( μ i k 0 max g μ i g 0 ) ] ,   k = 1 , , K
The motivation behind Equation (44) is that it provides a normalized membership degree, which is inversely proportional to the squared distance of observation i to the kth initial seed. The factor 1/(K–1) enforces the degrees of a given observation for the different clusters to sum up to 1. Then, from these initial membership degrees, Equation (45) computes the proportion of observations that would be (crisply) assigned to each cluster by following the maximum-rule. These proportions obviously add up to 1 too. Therefore, membership degrees need to be initialized prior to observations ratios, and in turn the former need initial seeds to have been previously drawn. In the computational study presented in next section, we employed Equations (44) and (45) for the initialization of all those methods considering observations ratios per cluster.
A third remark refers to the amount of information gathered by the proposed method in the updating formulae for both membership degrees and observations ratios. This point gets clearer by comparing Equations (31) and (32) with the corresponding updating expressions of the Tsallis divergence-based method [45] described above, Equations (22) and (23). Notice that, in these last formulae, the Tsallis method modifies squared distances d i k t 2 by a fixed addend ζ/(m–1). In contrast, in Equations (31) and (32), this fixed addend is, respectively, multiplied by 1 / ( i = 1 N k = 1 K ϕ k ( t 1 ) 1 m μ i k ( t 1 ) m ) and ( i = 1 N μ i k t m ) / ( i = 1 N k = 1 K ϕ k ( t 1 ) 1 m μ i k ( t 1 ) m ) , thus taking into account membership degrees and observations ratios in the additive modification of the distance effect. Thus, the proposed method seems to somehow take advantage of a greater amount of information in the updating steps than the Tsallis divergence-based method. This interesting feature provides a further motivation for the proposed methods based on Rényi divergence.
Finally, we summarize the algorithmic steps of the proposed fuzzy clustering method with Rényi relative entropy and cluster size, see Algorithm 1 below. Initial seeds V k 0 can be selected by means of any of the several initialization methods available (see, e.g., [25]). The stopping criteria are determined through a convergence threshold ε ( 0 , 1 ) and a maximum number of iterations t m a x . Let us recall that, besides these last and the number K of clusters to be built, the algorithm also needs the input of the fuzzifier parameter m > 1 and the regularization parameter ζ > 0 .
Algorithm 1. Fuzzy C-Means with Rényi Relative Entropy and Cluster Size
Inputs: Dataset X = ( X i ) i = 1 , , N , number of clusters K, stopping parameters ε and t m a x , fuzzifier parameter m and regularization parameter ζ .
Step 1: Draw initial seeds V k 0 , k = 1 , , K .
Step 2: Compute distances d 2 ( X i , V k 0 ) , i = 1 , , N ,   k = 1 , , K .
Step 3: Initialize μ i k 0 by Equation (44), i = 1 , , N ,   k = 1 , , K .
Step 4: Initialize ϕ k 0 by Equation (45), k = 1 , , K .
Step 5: Assign t = t + 1, and update centroids V k t by Equation (7), k = 1 , , K .
Step 6: Compute distances d i k t 2 , i = 1 , , N ,   k = 1 , , K .
Step 7: Membership degrees μ i k t are updated by Equation (31), i = 1 , , N ,   k = 1 , , K .
Step 8: Observations ratios per cluster are updated by Equation (32), k = 1 , , K .
Step 9: IF max i k ( | μ i k t μ i k ( t 1 ) | ) < ε or t + 1 > t m a x then stop; ELSE return to Step 5.
Output: Final centroid matrix Vt and partition matrix Mt.

3.2. Fuzzy C-Means with Rényi Divergence, Cluster Sizes and Gaussian Kernel Metric

An extension of the previous method is attained by substituting the calculation of Euclidean distances between observations and centroids in Equation (29) with a more flexible Gaussian kernel metric, defined through the kernel function K ( X i ,   V k ) = e x p ( γ X i V k 2 ) , γ > 0 . As exposed in Section 2.3, by using this Gaussian kernel function, the calculation of squared distances d i k 2 between transformed observations and centroids in n (a higher-dimensional space than the native space n originally containing the observations, n > n ) can be easily carried out through the expression d i k 2 = 2 ( 1 K ( X i ,   V k ) ) , without need of explicitly applying an embedding transformation φ : n   n . By working in such a higher-dimensionality setting, improved separability between clusters may be achieved, potentially enabling the formation of better partitions.
Then, this Gaussian kernel-based extension of the proposed method using Rényi divergence-based regularization with cluster sizes seeks to minimize the objective function
min M , Φ , V J m , ζ , γ ( M , Φ , V ) = min M , Φ , V i = 1 N k = 1 K ϕ k t 1 m μ i k t m 2 ( 1 K ( X i ,   V k t ) ) + ζ ( m 1 ) l n ( i = 1 N k = 1 K ϕ k t 1 m μ i k t m )
subject to Equations (1) and (3). Let us remark that, due to the introduction of the kernel function K ( X i ,   V k ) = e x p ( γ X i V k 2 ) , a third parameter γ > 0 is added in Equation (46) to the two already considered in the objective function of Equation (30), i.e., the fuzzifier parameter m > 1 and the regularization parameter ζ > 0 .
As above, the Lagrange multipliers method can be applied on the objective function, Equation (46), and the mentioned constraints in order to derive iterative formulae for the membership degrees μ i k t and the observations ratios per cluster ϕ k t , as well as for updating the centroids V k t . The same considerations above exposed regarding the orthogonality of the constraints and the lack of restrictions on the transformed centroids still hold.
Theorem 2.
The application of the Lagrange multipliers method to the objective function Equation (46) constrained by Equations (1) and (3) provides the following solutions t > 0 .
For updating the centroids Equation (29):
V k t = i = 1 N μ i k ( t 1 ) m K ( X i ,   V k ( t 1 ) )   X i i = 1 N μ i k ( t 1 ) m K ( X i ,   V k ( t 1 ) ) ,   k = 1 , , K
For membership degrees:
μ i k t = [ ϕ k ( t 1 ) 1 m ( 2 ( 1 K ( X i ,   V k t ) ) + ζ ( i = 1 N k = 1 K ϕ k ( t 1 ) 1 m μ i k ( t 1 ) m ) 1 ( m 1 ) ) ] 1 m 1 g = 1 K ( ϕ g ( t 1 ) 1 m ( 2 ( 1 K ( X i ,   V g t ) ) + ζ ( i = 1 N g = 1 K ϕ g ( t 1 ) 1 m μ i g ( t 1 ) m ) 1 ( m 1 ) ) ) 1 m 1 ,   i , k
For observations ratios per cluster:
ϕ k t = [ i = 1 N μ i k t m 2 ( 1 K ( X i , V k t ) ) + ζ ( i = 1 N μ i k t m ) ( m 1 ) ( i = 1 N k = 1 K ϕ k ( t 1 ) 1 m μ i k t m ) ] 1 m g = 1 K [ i = 1 N μ i g t m 2 ( 1 K ( X i , V g t ) ) + ζ ( i = 1 N μ i g t m ) ( m 1 ) ( i = 1 N g = 1 K ϕ g ( t 1 ) 1 m μ i g t m ) ] 1 m , k = 1 , , K
Proof. 
Only the derivation of the expression for centroid updating is now addressed, as the expressions for membership degrees and observations ratios per cluster are derived in an analogous way to Theorem 1, simply substituting d i k t 2 = ( X i V k t ) T   ( X i V k t ) with d i k t 2 = 2 ( 1 K ( X i ,   V g t ) ) .
Thus, taking into account that d i k t 2 = 2 ( 1 K ( X i ,   V g t ) ) = 2 ( 1 e x p ( γ X i V k 2 ) ) and the centroids are unconstrained, the derivative of Equation (46) with respect to V k t is equaled to zero, leading to
i = 1 N ϕ k t 1 m μ i k t m   [ 2 γ ( X i V k t ) K ( X i ,   V k t ) ] = 0 ,   k = 1 , , K
Equation (29) easily follows by solving for V k t in Equation (49) and replacing μ i k t with μ i k ( t 1 ) from the previous iteration, as well as V k t with V k ( t 1 ) in the calculation of the kernel function. It is easy to check that Equation (29) provides a local minimum as the second derivative of Equation (46) is positive. □
Similar remarks to those exposed above for the previous method also apply for this extension. Particularly, now previous-iteration centroids intervene in centroid updating, due to V k t being replaced with V k ( t 1 ) for the computation of Equation (29). Since the kernel function values K ( X i , V k t ) have to be computed at each iteration   t 0 to initialize and update both membership degrees and observations ratios per cluster, centroid updating can be carried out more efficiently by keeping in memory the K ( X i , V k t ) values until centroid updating at iteration t + 1 is performed. Likewise, both membership degrees and observations ratios need to be initialized through alternative expressions to Equations (44) and (45). To this aim, we simply substitute d i k 0 2 = ( X i V k 0 ) T ( X i V k 0 ) in Equation (44) with K ( X i , V k 0 ) in order to initialize membership degrees, leading to Equation (50),
μ i k 0 = ( 1 K 1 ) ( 1 K ( X i , V k 0 ) / k = 1 K K ( X i , V k 0 ) ) ,   i = 1 , , N ,   k = 1 , , K
Equation (45) is applied without modifications for initializing observations ratios per cluster.
The Algorithm 2 corresponding to this extension, presented below, follows basically the same lines as the non-extended method (Algorithm 1), simply substituting the computation of distances d 2 ( X i , V k t ) with that of kernel values K ( X i , V k t ) , which entails considering a kernel parameter γ > 0 in addition to those needed by the previous method.
Algorithm 2. Fuzzy C-Means with Rényi Divergence, Cluster Sizes and Gaussian Kernel Metric
Inputs: Dataset X = ( X i ) i = 1 , , N , number of clusters K, stopping parameters ε and t m a x , fuzzifier parameter m, regularization parameter ζ, and kernel parameter γ.
Step 1: Draw initial seeds V k 0 , k = 1,…,K.
Step 2: Compute kernel values K ( X i , V k 0 ) , i = 1,…,N, k = 1,…,K.
Step 3: Initialize μ i k 0 by Equation (50), i = 1,…,N, k = 1,…,K.
Step 4: Initialize ϕ k 0 by Equation (45), k = 1,…,K.
Step 5: Assign t = t + 1, and update centroids V k t by Equation (29), k = 1,…,K.
Step 6: Compute kernel values K ( X i ,   V k t ) , i = 1,…,N, k = 1,…,K.
Step 7: Membership degrees μ i k t are updated by Equation (47), i = 1,…,N, k = 1,…,K.
Step 8: Observations ratios per cluster are updated by Equation (48), k = 1,…,K.
Step 9: IF max i k ( | μ i k t μ i k ( t 1 ) | ) < ε or t + 1 > t m a x then stop; ELSE return to Step 5.
Output: Final centroid matrix Vt and partition matrix Mt.

4. Computational Study

This section describes the setup (Section 4.1) and results (Section 4.2) of the computational study carried out to analyze the performance of the proposed fuzzy clustering methods.

4.1. Experimental Configuration

The objective of this study was to analyze the performance on real data of the proposed fuzzy clustering method and its kernel extension in comparison to that of methods with a similar approach, particularly those employing relative entropy measures different to Rényi divergence. To this aim, we conducted a computational experiment consisting of the application of 10 non-hierarchical, prototype-based clustering methods on 20 well-known supervised classification datasets. All clustering methods were inputted with the known, right number K of classes or cluster to build on each dataset. A genetic differential evolution algorithm was employed to search for the optimal parameters of each method (except the K-means, which has no parameters) on each dataset. For the best parametric configuration of each method returned from the genetic algorithm, a usual supervised accuracy metric was computed comparing the method’s output with the actual class labels of each dataset. This process was replicated 10 times for each method and dataset, each time using different initial seeds drawn by the k-means++ method [30] (all methods using the same seeds). The mean accuracy of each method on each dataset was finally obtained.
To provide a variety of references, besides the methods here proposed, we included in this study the standard K-means, Fuzzy C-means (FCM), and FCM with cluster sizes (FCMA) methods, along with the methods using Kullback–Leiber and Tsallis divergences described in Section 2.2, as well as their extensions through a Gaussian kernel metric. The included methods and their nomenclature and ranges considered in the search for optimal parameters are presented in Table 1.
The 20 datasets selected for this experiment constitute usual benchmarks for cluster analysis, and were downloaded from three online repositories:
UCI machine learning website [67]. Datasets: Brest Cancer Coimbra (BCC), Blood Transfusion Service Center (Blood), Lenses, Vertebral column (Vertebral-column-2 and Vertebral-column-3), and Wholesale customers (WCD-channel and WCD-region).
Website of Speech and Image Processing Unit, School of Computing at University of Eastern Finland [68]. Datasets: Flame, Jain, Compound, and Aggregation.
Keel website at University of Granada [69,70]. Datasets: Appendicitis, Bupa, Haberman, Hayes–Roth, Hear, Iris, Sonar, and Spectfheart.
The main characteristics of these 20 datasets are summarized in Table 2. Let us remark that all these classification datasets are of supervised nature, that is, they provide the actual class label for each pattern/instance along with the available explanatory variables or features. However, these class labels were only used in this experiment once an optimal parametric configuration was selected by the differential evolution algorithm for each method, replication, and dataset. That is, each method was then fitted to the corresponding dataset with those optimal parameters, and the cluster assignments obtained (through the maximum rule) from the output partition matrix were only then compared to the actual class labels. This allowed obtaining the proportion of correctly classified instances, or classification accuracy, for each combination of method, dataset, and replication. As usual, a maximum-matching procedure was applied at this point in order to select the matching between output cluster numbers and actual classes providing the best accuracy. Finally, the mean accuracy along the 10 replications was computed for each method and dataset.

Differential Evolution Algorithm

As just mentioned, a differential evolution algorithm was applied in this experiment to search for the optimal parameters of each clustering method, for each combination of dataset and replication. Let us briefly recall that differential evolution [71] algorithms are a kind of evolutionary computation technique, quite similar to genetic algorithms but exhibiting a real-valued (instead of binary-valued) codification of the genotype or elements to be optimized, which also leads to some differences in how the mutation and crossover operators are designed and implemented. The specific differential evolution algorithm we applied features self-adaption of the algorithm’s factor and crossover parameters [72,73].
Next, we describe the main details of the application of the differential evolution (DE) algorithm in the context of the present computational study. Firstly, let D denote the number of parameters to be optimized for a given clustering method, as shown in Table 1 (for instance, it is D = 1 for the FCM, D = 2 for the proposed Rényi divergence-based method, and D = 3 for its kernel extension). The DE algorithm considers a set or population of NP D-dimensional vectors p l = ( p 1 l , , p D l ) , l = 1 , , N P , each providing a feasible solution of the optimization problem associated to the parameters of the clustering method. Here, we set the population size as NP = 15D, providing a greater set of solution vectors as the number of parameters to be searched for increases.
Along the execution of the DE algorithm, the initial population of solution vectors evolves by certain mechanisms, trying to find better solutions that reflect on a better performance of the clustering method, without being stuck in a local minimum. Particularly, the applied DE algorithm consists of the following four steps, where the last three are repeated until a preset maximum number of iterations (Gmax = 2000 in our experiment) is reached:
At the initialization step, NP D-dimensional vectors p l are randomly generated, each coordinate p d l being uniformly drawn in the range designated for the dth parameter to be optimized, d = 1,…,D (see Table 1).
At the mutation step, we applied the so-called rand-to-rand/1 strategy, which for each l = 1 , , N P creates a mutant solution vector v l from three randomly selected vectors of the current population, following the expression
v l = p r 1 + F l ( p r 2 p r 3 )
where r 1 ,   r 2 ,   and   r 3 are randomly selected integers in the interval [1,NP] and F l ( 0 , 1 ) is a scale parameter that controls the amount of diversity being introduced in the population of solutions through mutation (see Appendix A).
At the crossover step, for each l = 1 , , N P , a candidate solution vector q l is generated, in such a way that for each d = 1,…,D it is assigned q d l = v d l with probability C R l , and q d l = p d l otherwise. The crossover parameter C R l ( 0 , 1 ) thus determines the frequency with which current population solutions and mutations mix to create diverse candidate solutions.
Finally, at the selection step, for each l = 1 , , N P , the parametric configurations of the considered clustering method represented by the current solution p l and the candidate solution q l are compared through their fitness, that is, their performance at providing an adequate partition on the considered dataset and replication. Such fitness was assessed by the Xie–Beni validation measure [74] (see Appendix B)
X B ( p ) = i = 1 N k = 1 K     μ i k m d i k 2 N ( min l k V l V k 2 )
where μ i k ,   V k and d i k 2 , respectively, denote the final membership degrees, centroids, and distances from observations to centroids returned by the considered clustering method after being run with the parametric configuration p (see Appendix C). Therefore, at this selection step, the corresponding clustering method has to be run twice for each l, once with parametric configuration p l and once with q l , respectively, producing fitness assessments X B ( p l ) and X B ( q l ) . Then, p l is included in the DE algorithm next iteration’s population if X B ( p l ) < X B ( q l ) , otherwise q l is used instead.
We used the self-adaption scheme proposed in [75] for the DE algorithm’s scale F l and crossover C R l parameters. This scheme allows these parameters to vary for each population’s solution vector and along the execution of the DE algorithm, enabling a more efficient search of the solution space. Particularly, at a given iteration G = 1,…,Gmax, the DE parameters were obtained as follows:
Scale: Draw a value F l , t e m p from a N( 0.5 , 0.25 ) distribution. Then, compute the scale parameter to be used at iteration G for the lth population member as
F l , G = { 1 ,   i f   F l , t e m p > 1 0.1 ,   i f   F l , t e m p < 0.1 F l , t e m p ,   o t h e r w i s e
Crossover: Draw two U(0,1) random numbers r a n d 1 and r a n d 2 , and assign C R l , t e m p = 0.1 + r a n d 1 · 0.8 . Then, compute the crossover parameter to be used at iteration G for the lth population member as
C R l , G = { C R l , t e m p   i f   r a n d 2 < 0.1 C R l , G 1     o t h e r w i s e

4.2. Results

Table 3 presents the mean accuracy, along the 10 replications, attained by each clustering method on each dataset considered in this experiment. The best performance for each dataset is highlighted in bold. A first impression of these results is that two methods seem to stand out among the others, each achieving the best performance on four datasets: the kernel extension of the proposed Rényi divergence-based method (kRenyi) and the method based on Tsallis divergence (Tsallis).
In order to descriptively compare the aggregated performance of the methods in all datasets in terms of central tendency measures, Table 4 shows the mean and median accuracy attained by each method along all datasets, as well as the related standard deviation. Notice then that the kRenyi method achieves the best mean and median performance, while the proposed Renyi method ranks second in terms of median accuracy. Indeed, there seems to be a clear shift of performance in terms of median accuracy between the proposed methods (Renyi and kRenyi) and the rest. Table 4 also shows that only the proposed method (Renyi) seems to exhibit a positive synergy with the usage of a Gaussian kernel extension, as it is the only method to improve on both its mean and median accuracy when applying such extension. Let us also note that the proposed Renyi method is the one with the lowest accuracy variability along the different datasets, as measured by the standard deviations shown in Table 4. In this sense, the combination of a relatively high median accuracy with a relatively small variability may be taken to suggest that the proposed Renyi method presents a quite robust performance.
In order to analyze possible statistically significant differences among the methods, we applied two non-parametrical multiple comparisons statistical tests, the Friedman ranks test [76,77] and the Friedman aligned ranks tests [78]. Both tests constitute non-parametric alternatives to the parametric ANOVA test for multiple comparisons. The main difference between both Friedman tests is that the former (Friedman ranks test) treats each dataset as an experimental block, thus considering that the different samples provided by the results of each method are dependent, while the latter (Friedman aligned ranks test) removes this assumption and considers all samples to be independent, allowing for a certain comparability between the performances of the different methods on different datasets (see Appendix D). That is, the former procedure tests for differences in the mean ranks controlling for differences due to datasets, while the second tests for differences in mean (aligned) ranks, without taking into account dataset differences.
Table 5 shows the mean ranks attained by each method. Notice that the kernel extension of the proposed method, kRenyi, obtains the lowest rank, consistently with the best central tendency measures already observed in Table 4. However, the related Friedman ranks test provides a p-value of 0.5089, and thus it is not significant at a significance level of α = 0.05. Therefore, this test does not allow concluding that significant differences exist between the 10 analyzed clustering methods when controlling for datasets differences.
In turn, Table 6 presents the mean aligned rank computed for each method. Again, the kRenyi method obtains the lowest rank. However, now the related Friedman aligned ranks test provides a p-value of 0.0304, being therefore significant at a significance level of α = 0.05. This leads to concluding that statistically significant differences exist among the 10 methods when performance comparability between datasets is allowed.
Due to the significance of the previous aligned ranks multiple comparisons test, and given that the proposed kRenyi method obtains the best aligned rank, we next applied a post hoc test on the results of the aligned rank test to check for statistically significant differences between the kRenyi method and the rest. That is, we take kRenyi as a control method, and study the significance of the aligned rank differences between this control and the other methods. Table 7 presents the p-values provided by this test for each comparison versus the kRenyi method. Only the comparison of this last with EFCA, the Kullback–Leiber divergence-based method, is significant at a significance level of α = 0.05, although the p-value obtained for the comparison with K-means is also small and would be significant at a significance level of α = 0.1. Other comparisons’ p-values are also relatively small, pointing to a potential superiority of the kRenyi method over the FCM and its kernel extension (kFCM), as well as over the proposed, non-kernel extended Renyi method or the kernel extension of the Tsallis divergence-based method (kTsallis).
To compare the computational efficiency of the proposed methods, Table 8 presents the mean times expended by each considered method on the benchmark datasets along the 10 experimental replications. Let us remark that the mean computational costs shown in this table were calculated by applying each method on each dataset with the corresponding parametric configuration returned by the DE algorithm at each replication. That is, these computational costs do not include execution times of the DE algorithm; instead, they represent the mean execution times of each single method on each dataset. In this sense, it is important to notice that actual execution times of a method on a given dataset may vary considerably from one replication to another depending on the interaction between the initial seeds and the parametric configuration selected for the method at each replication: in some replications, convergence of a method may occur much more quickly than in other replications. This helps explain the relatively large standard deviations presented in Table 8, as well as the fact that lower complexity methods, e.g., the K-means, do not obtain systematically lower computational costs than greater complexity methods, e.g., FCM or the proposed Renyi and kRenyi methods. This variability also makes it difficult to extract bold conclusions from the data presented in Table 8. Although the Renyi method exhibits the largest mean cost along all datasets (64.2 ms), its kernel extension kRenyi (which obtained the best results in term of classification accuracy) almost halves its execution times (39.3 ms), even presenting the best cost performance on several datasets. In turn, this global mean execution time of the kRenyi method only doubles that of the K-means (18 ms) and represents a 50% increment with respect to that of the FCM (26.3 ms). Furthermore, the kRenyi method improves on the mean computational cost of other divergence-based methods, such as Tsallis and kTsallis. The observed differences in computational costs among all the considered methods are rather small on the benchmark datasets of this experiment, and they are rarely greater than an order of magnitude.
To sum up, the results of the computational study carried out consistently point to a relatively good performance of the proposed Rényi divergence-based methods, especially in the case of the Gaussian kernel-extended kRenyi method. Although the evidence extracted from this experiment is not fully conclusive from a statistical point of view, particularly on a dataset-by-dataset basis, there seems however to be enough support to at least conclude that kRenyi performs significantly better than the Kullback–Leiber divergence-based method on an all-datasets basis. This conclusion was somehow to be expected since Rényi divergence is indeed a generalization of Kullback–Leiber divergence. Moreover, a close-to-significance superiority of kRenyi is also suggested with respect to K-means, FCM (with and without kernel extension), and the kernel-extended Tsallis divergence-based method. The small increment (if any) in computational cost of the proposed methods seems to compensate for the improvement in classification performance, especially in the case of the kRenyi method.

5. Conclusions

This paper delves into the usage of entropy measures as regularization functions penalizing the formation of too overlapping partitions in fuzzy clustering, building on some previous proposals applying relative entropy measures, or divergences, to that regularization aim. In this sense, this work particularly focuses on the application of Rényi divergence between fuzzy membership degrees of observations into clusters, on the one side, and observations ratios per cluster, on the other side, in the context of fuzzy clustering problems considering cluster size variables. Since Rényi divergence (as also happens with Tsallis divergence) provides a generalization of several other, more specific divergence measures, particularly of Kullback–Leiber divergence, its application in fuzzy clustering seems interesting in order to devise more general and potentially more effective methods. This led us to the proposal of two fuzzy clustering methods exhibiting a Rényi divergence-based regularization term, the second method extending the first one through the consideration of a Gaussian kernel metric instead of the standard Euclidean distance.
An extensive computational study was also carried to illustrate the feasibility of our approach, as well as to analyze the performance of the proposed clustering methods in comparison with that of several other methods, particularly some methods also applying divergence-based regularization and Gaussian kernel metrics. The results of this study, although not fully conclusive from a statistical point of view, clearly point to a comparatively good performance of the proposed method, and particularly of its Gaussian kernel extension, which significantly improves on the performance of Kullback–Leiber divergence-based clustering methods.
Future research by the authors following this work is ongoing in the direction of studying the application on the methods here proposed of other distances or metrics different from the Euclidean and the Gaussian kernel metrics. We are particularly interested in the effect of the Mahalanobis distance on our methods to deal with non-spherical clusters and its potential synergy with Rényi-based regularization.

Author Contributions

Conceptualization, J.B. and J.M.; methodology, J.B.; software, J.B.; validation, J.M., D.V. and J.T.R.; formal analysis, J.B.; investigation, J.B.; resources, J.B.; data curation, J.B.; writing—original draft preparation, J.B. and D.V.; writing—review and editing, J.T.R., D.V. and J.B.; visualization, J.B.; supervision, J.M., J.T.R. and D.V.; project administration, J.M.; and funding acquisition, J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Government of Spain (grant PGC2018-096509-B-100), and Complutense University of Madrid (research group 910149).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Such diversity is convenient in order to adequately explore the solution space and avoid getting stuck in a local minimum. When F l is close to 0, the population will tend to converge too quickly to a non-optimal solution with low diversity, and, when F l is close to 1, the population will tend to converge too slowly to a non-optimal solution with low diversity as well. Therefore, F l should be kept away from 0 or 1.

Appendix B

Notice that the Xie–Beni validation measure can be interpreted as the intra-cluster variance divided by the minimum distance between centroids, and thus it only uses information from the clustering process, not employing knowledge of the actual class labels. The motivation behind employing this validation measure as the fitness criterion of the DE algorithm instead of the objective function of the clustering method being considered is that the former provides a fitness metric external to the clustering method, avoiding overfitting issues such as shrinking the entropy regularization parameter ζ to 0.

Appendix C

The parameter m in Equation (52) is obtained as the first coordinate of the solution vector p, except for the EFCA method (which implicitly considers m = 1) and the K-means (which has no parameters to optimize).

Appendix D

This is reflected in the different procedures employed by either test when computing average ranks for each method: On the one hand, the Friedman ranks test proceeds by ranking the methods’ performance on each dataset, i.e., from the 1st to the 10th for each dataset, and later computes the average rank of each method. On the other hand, the Friedman aligned ranks test first calculates the mean accuracy of all methods on each dataset, and then subtracts it from each method’s accuracy. This process is repeated for all datasets, and then a single ranking of all obtained differences is computed (i.e., from the 1st to the 200th, as there are 10 methods and 20 datasets). These ranks are known as aligned ranks. Finally, the average aligned rank of each method is calculated.

References

  1. Anderberg, M.R. Cluster Analysis for Application; Academic Press: New York, NY, USA, 1972. [Google Scholar]
  2. Härdle, W.; Simar, L. Applied Multivariate Statistical Analysis, 2nd ed.; Springer Berlin Heildeberg: Berlin/Heildeberg, Germany, 2007. [Google Scholar]
  3. Johnson, J.W.; Wichern, D.W. Applied Multivariate Statistical Analysis; Prentice Hall: Upper Saddle River, NJ, USA, 1998. [Google Scholar]
  4. Srivastava, M.S. Methods of Multivariate Statistics; John Wiley & Sons, Inc.: New York, NY, USA, 2002. [Google Scholar]
  5. Johnson, S.C. Hierarchical clustering schemes. Psychometrika 1967, 32, 241–254. [Google Scholar] [CrossRef]
  6. Ward, J.H. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
  7. Forgy, E. Clustering analysis of multivarate data: Efficiency vs. interpretability of classification. Biometrics 1965, 21, 768–769. [Google Scholar]
  8. MacQueen, J.B. Some methods of classifications and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 21 June–18 July 1965; pp. 281–297. [Google Scholar]
  9. Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Clustering Analysis; John Wiley & Sons, Inc.: New York, NY, USA, 1990. [Google Scholar]
  10. Park, H.-S.; Jun, C.-H. A simple and fast algorithm for K-medoids clustering. Expert Syst. Appl. 2009, 36, 3336–3341. [Google Scholar] [CrossRef]
  11. Chen, Y. Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1995, 17, 790–799. [Google Scholar]
  12. Zhang, T.; Ramakrishnan, R.; Livny, M. BIRCH: An efficient data clustering method for very large databases. ACM Sigmod Rec. 1996, 25, 103–114. [Google Scholar] [CrossRef]
  13. Ester, M.; Kriegel, H.-P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 226–231. [Google Scholar]
  14. Ankerst, M.; Breuning, M.M.; Kriegel, H.-P.; Sander, J. OPTICS: Ordering points to identify the clustering structure. ACM Sigmond Rec. 1999, 28, 49–60. [Google Scholar] [CrossRef]
  15. Schaeffer, S.E. Graph Clustering. Comput. Sci. Rev. 2007, 1, 27–64. [Google Scholar] [CrossRef]
  16. Ng, A.Y.; Jordan, M.I.; Weiss, Y. On spectral clustering: Analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2001, 14, 849–856. [Google Scholar]
  17. von Luxburg, U.A. Tutorial of spectral clustering. Stat. Comput. 2007, 17, 395–416. [Google Scholar] [CrossRef]
  18. Liu, J.; Han, J. Spectral clustering. In Data Clustering: Algorithms and Applications; Aggarwal, C., Reddy, C., Eds.; CRC Press Taylor and Francis Group: London, UK, 2014; pp. 177–200. [Google Scholar]
  19. Wang, W.; Yang, J.; Muntz, R. STING: A statistical information grid approach to spatial data mining. In Proceedings of the 23rd VLDB Conference, Athens, Greece, 25–29 August 1997; pp. 186–195. [Google Scholar]
  20. Sheikholeslami, G.; Chatterjee, S.; Zhang, A. Wavecluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the 24th VLDB Conference, New York, NY, USA, 24–27 August 1998; pp. 428–439. [Google Scholar]
  21. Miyamoto, S.; Ichihashi, H.; Honda, K. Algorithms for fuzzy clustering. In Methods in C-Means Clustering with Applications; Kacprzyk, J., Ed.; Springer Berlin Heidelberg: Berlin/Heidelberg, Germany, 2008; Volume 299. [Google Scholar]
  22. Kohonen, T. Self-Organizing Maps, 2nd ed.; Springer: Berlin, Germany, 1997. [Google Scholar]
  23. Bottou, L.; Bengio, Y. Convergence properties of the k-means algorithms. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 27–30 November 1995; pp. 585–592. [Google Scholar]
  24. Cebeci, Z.; Yildiz, F. Comparison of k-means and fuzzy c-means algorithms on different cluster structures. J. Agric. Inform. 2015, 6, 13–23. [Google Scholar] [CrossRef] [Green Version]
  25. Celebi, M.E.; Kingravi, H.A.; Vela, P.A. A Comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl. 2013, 40, 200–210. [Google Scholar] [CrossRef] [Green Version]
  26. Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A K-means clustering algorithm. J. R. Stat. Society. Ser. C 1979, 28, 100–108. [Google Scholar] [CrossRef]
  27. Jain, A.K. Data Clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
  28. Steinley, D. K-means clustering: A half-century synthesis. Br. J. Math. Stat. Psychol. 2006, 59, 1–34. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  29. Selim, S.Z.; Ismail, M.A. K-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Trans. Pattern Anal. Mach. Intell. 1984, 6, 81–87. [Google Scholar] [CrossRef]
  30. Arthur, D.; Vassilvitskii, S. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’07), New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
  31. Zadeh, L.A. Fuzzy sets. Information and control. J. Symb. Log. 1965, 8, 338–353. [Google Scholar]
  32. Dunn, J.C. A fuzzy relative of the ISODATA process and its use in detecting compact well-separed clusters. Cybern. Syst. 1973, 3, 32–57. [Google Scholar]
  33. Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms; Plenium Press: New York, NY, USA, 1981. [Google Scholar]
  34. Amo, A.; Montero, J.; Biging, G.; Cutello, V. Fuzzy classification systems. Eur. J. Oper. Res. 2004, 156, 495–507. [Google Scholar] [CrossRef]
  35. Bustince, H.; Fernández, J.; Mesiar, R.; Montero, J.; Orduna, R. Overlap functions. Nonlinear Anal. Theory Methods Appl. 2010, 72, 1488–1499. [Google Scholar] [CrossRef]
  36. Gómez, D.; Rodríguez, J.T.; Montero, J.; Bustince, H.; Barrenechea, E. n-Dimensional overlap functions. Fuzzy Sets Syst. 2016, 287, 57–75. [Google Scholar] [CrossRef]
  37. Castiblanco, F.; Franco, C.; Rodríguez, J.T.; Montero, J. Evaluation of the quality and relevance of a fuzzy partition. J. Intell. Fuzzy Syst. 2020, 39, 4211–4226. [Google Scholar] [CrossRef]
  38. Li, R.P.; Mukaidono, M. A Maximun Entropy Approach to fuzzy clustering. In Proceedings of the 4th IEEE International Conference on Fuzzy Systems (FUZZ-IEEE/IFES 1995), Yokohama, Japan, 20–24 March 1995; pp. 2227–2232. [Google Scholar]
  39. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 623–656. [Google Scholar] [CrossRef]
  40. Miyamoto, S.; Kurosawa, N. Controlling cluster volume sizes in fuzzy C-means clustering. In Proceedings of the SCIS & ISIS 2004, Yokohama, Japan, 21–24 September 2004; pp. 1–4. [Google Scholar]
  41. Ichihashi, H.; Honda, K.; Tani, N. Gaussian Mixture PDF Approximation and fuzzy C-means clustering with entropy regulation. In Proceedings of the Fourth Asian Fuzzy Systems Symposium, Tsukuba, Japan, 31 May–3 June 2000; pp. 217–221. [Google Scholar]
  42. Kullback, S.; Leibler, R.A. On information theory. Ann. Math. Statist. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  43. Tsallis, C. Possible Generalization of Boltzmann-Gibbs Statistics. J. Stat. Phys. 1988, 52, 478–479. [Google Scholar] [CrossRef]
  44. Kanzawa, Y. On possibilistic clustering methods based on Shannon/Tsallis-entropy for spherical data and categorical multivariate data. In Lectures Notes in Computer Science; Torra, V., Narakawa, Y., Eds.; Springer: New York, NY, USA, 2015; pp. 115–128. [Google Scholar]
  45. Zarinbal, M.; Fazel, M.H.; Turksen, I.B. Relative entropy fuzzy C-means clustering. Inf. Sci. 2014, 260, 74–97. [Google Scholar] [CrossRef]
  46. Rényi, A. On measures of Entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; pp. 547–561. [Google Scholar]
  47. Jenssen, R.; Hild, K.E.; Erdogmus, D. Clustring using Renyi’s entropy. In Proceedings of the International Joint Conference on Neural Networks, Porland, OR, USA, 26 August 2003; pp. 523–528. [Google Scholar]
  48. Popescu, C.C. A Clustering model with Rényi entropy regularization. Math. Rep. 2009, 11, 59–65. [Google Scholar]
  49. Ruspini, E. A new approach to clustering. Inform. Control 1969, 15, 22–32. [Google Scholar] [CrossRef] [Green Version]
  50. Pal, N.R.; Bezdek, J.C. On cluster validity for the fuzzy C-means model. IEEE Trans. Fuzzy Syst. 1995, 3, 370–379. [Google Scholar] [CrossRef]
  51. Yu, J. General C-means clustering model. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1197–1211. [Google Scholar]
  52. Yu, J.; Yang, M.S. Optimality test for generalized FCM and its application to parameter selection. IEEE Trans. Fuzzy Syst. 2005, 13, 164–176. [Google Scholar]
  53. Jain, A.; La, M. Data clustering: A user’s dilemma. Lect. Notes Comput. Sci. 2005, 3776, 1–10. [Google Scholar]
  54. Huang, D.; Wang, C.D.; Lai, J.H. Locally weighted ensemble clustering. IEEE Trans. Cybern. 2018, 48, 1460–1473. [Google Scholar] [CrossRef] [Green Version]
  55. Huang, D.; Wang, C.D.; Lai, J.H.; Kwoh, C.K. Toward multidiversified ensemble clustering of high-dimensional data: From subspaces to metrics and beyond. IEEE Trans. Cybern. 2021. [Google Scholar] [CrossRef] [PubMed]
  56. Hartle, R.V. Transmission of information. Bell Syst. Tech. J. 1928, 7, 535–563. [Google Scholar] [CrossRef]
  57. Bennett, C.H.; Bessette, F.; Brassard, G.; Salvail, L.; Smolin, J. Experimental quantum cryptography. J. Crytol. 1992, 5, 3–28. [Google Scholar] [CrossRef]
  58. Cover, T.M.; Thomas, J. Elements of Information; Jonh Wiley & Sons: New Jersey, NJ, USA, 2006. [Google Scholar]
  59. Gray, R.M. Entropy and Information Theory; Springer: New York, NY, USA, 2010. [Google Scholar]
  60. Van Erven, T.; Harremoës, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 7, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
  61. Bhattacharyya, A. On a measure of divergence between two multinomial populations. Sankhya Indian J. Stat. 1946, 7, 401–406. [Google Scholar]
  62. Ménard, M.; Courboulay, V.; Dardignac, P.A. Possibilistic and probabilistic fuzzy clustering: Unification within the framework of the non-extensive thermostatistics. Pattern Recognit. 2003, 36, 1325–1342. [Google Scholar] [CrossRef]
  63. Boser, B.E.; Guyon, I.M.; Vapnik, V.N. A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory—COLT ‘92, Pittsburgh, PA, USA, 27–29 July 1992; pp. 144–152. [Google Scholar]
  64. Graves, D.; Pedrycz, W. Kernel-based fuzzy clustering and fuzzy clustering: A comparative experimental study. Fuzzy Sets Syst. 2010, 4, 522–543. [Google Scholar] [CrossRef]
  65. Vert, J.P. Kernel Methods in Computational Biology; The MIT Press: Cambridge, MA, USA, 2004. [Google Scholar]
  66. Wu, K.L.; Yang, M.S. Alternative C-means clustering algorithms. Pattern Recognit. 2002, 35, 2267–2278. [Google Scholar] [CrossRef]
  67. UCI Machine Learning Repository, University of California. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 26 January 2021).
  68. School of Computing University of Eastern Finland. Available online: http://cs.joensuu.fi/sipu/datasets (accessed on 26 January 2021).
  69. Keel. Available online: www.keel.es (accessed on 26 January 2021).
  70. Alcalá-Fernández, J.; Fernández, A.; Luego, J.; Derrac, J.; García, S.; Sánchez, L. KEEL data-mining software tool: Data set repository, integration of agorithms and experimental analysis framework. J. Mult. Valued Log. Soft Comput. 2011, 17, 255–287. [Google Scholar]
  71. Price, K.; Storn, R. Differential Evolution—A Simple and Efficient Adaptative Scheme for Global Optimization over Continuous Spaces; Technical Report TR-95-012; International Computer Science Institute: Berkeley, CA, USA, 1995. [Google Scholar]
  72. Brest, J.; Greiner, S.; Boskovic, B.; Mernik, M.; Zumer, V. Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems. IEEE Trans. Evol. Comput. 2006, 10, 646–657. [Google Scholar] [CrossRef]
  73. Qinqin, F.; Xuefeng, Y. Self-adaptive differential evolution algorithm with zoning evolution of control parameters and adaptive mutation strategies. IEEE Trans. Cybern. 2015, 46, 2168–2267. [Google Scholar]
  74. Xie, X.L.; Beni, G. A Validity measure for fuzzy clustering. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 841–847. [Google Scholar] [CrossRef]
  75. Liu, B.; Yang, H.; Lancaster, J.M. Synthesis of coupling matrix for diplexers based on a self-adaptive differential evolution algorithm. IEEE Trans. Microw. Theory Tech. 2018, 66, 813–821. [Google Scholar] [CrossRef]
  76. Friedman, M. The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J. Am. Stat. Assoc. 1937, 32, 674–701. [Google Scholar] [CrossRef]
  77. Friedman, M. A comparison of alternative tests of significance for the problem of m rankings. Ann. Math. Stat. 1940, 11, 86–92. [Google Scholar] [CrossRef]
  78. Hodges, J.L.; Lehmann, E.L. Ranks methods for combination of independent experiments in analysis of variance. Ann. Math. Stat. 1962, 33, 482–497. [Google Scholar] [CrossRef]
Table 1. The range of values of the parameters of each method.
Table 1. The range of values of the parameters of each method.
Method\Parameter Fuzzier   ( m   ) Entropy   ( ζ   ) Gaussian   Kernel   ( γ )
K-meansK-means
FCMFuzzy C-Means[1.075,6]
FCMAFuzzy C-Means with cluster size[1.075,6]
kFCMKernel Fuzzy C-Means[1.075,6] [0.001,10]
kFCMAKernel Fuzzy C-Means with cluster size[1.075,6] [0.001,10]
EFCAKullback–Leiber relative entropy with cluster size [0.000001,10]
TsallisTsallis relative entropy with cluster size[1.075,6][0.000001,10]
kTsallisKernel Tsallis relative entropy with cluster size[1.075,6][0.000001,10][0.001,10]
RenyiRényi relative entropy with cluster size[1.075,6][0.000001,10]
kRenyiKernel Rényi relative entropy with cluster size[1.075,6][0.000001,10][0.001,10]
Table 2. Datasets’ characteristics.
Table 2. Datasets’ characteristics.
DatasetsDatumFeaturesClassesDatasetsDatumFeaturesClasses
Aggregation78827Iris15043
Appendicitis10692Jain37322
BCC116102Lenses2443
Blood74852Sonar208602
Bupa34562Spectfheart267442
Compound39926Vertebral-column-331063
Flame24022Vertebral-column-231062
Haberman30632WCD-channel44082
Hayes–Roth16043WCD-region44083
Heart270132WDBC569302
Table 3. Results of the computational study. Table cells present the mean accuracy of the 10 experimental replications attained by each method on each dataset.
Table 3. Results of the computational study. Table cells present the mean accuracy of the 10 experimental replications attained by each method on each dataset.
DatasetsK-MeansFCMkFCMFCMAkFCMAEFCATsalliskTsallisRenyikRenyi
Flame84.8384.1781.9689.0088.9678.2986.7186.7969.6389.13
Jain88.2087.1389.2890.1990.1380.6290.4885.7483.7590.13
Compound58.2255.2159.4550.7550.1346.3962.9847.0765.5466.92
Aggregation77.5867.7753.5757.5457.3545.0142.2147.9263.6567.16
Haberman50.6551.9652.2950.7250.8864.3851.2757.3562.4850.59
Sonar54.8155.3454.9054.0953.8053.3752.6053.5154.0953.94
Hayes–Roth41.4443.9443.0642.4442.1342.1944.1942.3840.8844.94
Bupa54.6150.7255.7755.6555.6545.5757.8057.3956.2355.65
Appendicitis81.0474.5387.7478.2176.5183.4982.6480.9480.0077.74
Iris77.4785.3363.7385.2082.0761.0080.0771.6775.8784.87
Lenses51.6750.8349.1757.0852.5058.7555.4257.5053.3351.67
Heart71.0078.8981.1179.8579.8574.9679.5954.4480.0080.00
Vertebral-column-347.2958.7155.8457.6556.1948.1959.6551.5250.3958.16
Vertebral-column-265.5566.7764.8467.9467.8765.1954.1967.3960.7168.13
WDBC92.7992.7985.8992.3291.7664.9091.8179.4774.7192.14
BCC51.6450.8650.2652.6752.6748.6251.3855.1752.7652.59
WCD-channel56.5756.3656.1457.1857.6660.3957.2362.1657.7557.43
WCD-region49.5543.8253.6445.7346.5057.4565.9552.1152.0044.57
Blood58.8256.4259.3653.6053.9064.0957.0960.8062.0157.19
Spectfheart62.7359.1865.5466.6768.8873.7872.7377.8366.6770.19
Table 4. Mean and median accuracy attained by each method on all datasets and related standard deviations.
Table 4. Mean and median accuracy attained by each method on all datasets and related standard deviations.
K-MeansFCMkFCMFCMAkFCMAEFCATsalliskTsallisRenyikRenyi
Mean63.8263.5463.1864.2263.7760.8364.862.4663.1265.66
Median58.5257.5657.7557.3656.7760.6958.7257.4562.2462.54
Std. Desv.15.0915.1714.1615.8415.6912.6815.4713.5111.4815.39
Table 5. Average ranks of the compared methods in the Friedman rank test.
Table 5. Average ranks of the compared methods in the Friedman rank test.
MethodAvg. Rank
kRenyi4.6
Tsallis4.8
FCMA4.85
kTsallis5.1
Renyi5.5
kFCM5.7
kFCMA5.75
FCM6.025
K-means6.225
EFCA6.45
Table 6. Average aligned ranks of the compared methods in the Friedman aligned rank test.
Table 6. Average aligned ranks of the compared methods in the Friedman aligned rank test.
MethodAvg. Aligned Rank
kRenyi78.575
Tsallis80.9
FCMA95.25
kFCMA100.625
kTsallis103.2
Renyi105.475
FCM106.375
kFCM106.45
K-means110.3
EFCA117.85
Table 7. Post hoc p-values for the comparison of the kRenyi method versus the other methods.
Table 7. Post hoc p-values for the comparison of the kRenyi method versus the other methods.
Methodp-Value
EFCA0.0319
K-means0.0830
kFCM0.1278
FCM0.1288
Renyi0.1416
kTsallis0.1785
kFCMA0.2283
FCMA0.3623
Tsallis0.8989
Table 8. Computational cost of the different methods on the benchmark datasets. Table cells present the mean time in milliseconds (ms) expended by each method on each dataset with the parametric configuration returned by the DE algorithm for the 10 replications. In brackets are the standard deviations obtained for the 10 experimental replications. The lowest mean cost for each dataset is highlighted in bold.
Table 8. Computational cost of the different methods on the benchmark datasets. Table cells present the mean time in milliseconds (ms) expended by each method on each dataset with the parametric configuration returned by the DE algorithm for the 10 replications. In brackets are the standard deviations obtained for the 10 experimental replications. The lowest mean cost for each dataset is highlighted in bold.
DatasetsK-MeansFCMkFCMFCMAkFCMAEFCATsalliskTsallisRenyikRenyi
Aggregation56 (37.9)132 (18.1)91.2 (15.8)122 (34.1)152 (59.1)17.2 (25.8)62 (52.9)126 (11.4)179.6 (92.2)3.2 (1.7)
Appendicitis14.8 (13.3)17.6 (7.4)10 (3.9)38.8 (9.4)42.8 (8.2)16.8 (16.9)55.6 (10.7)32 (23.3)53.2 (12.9)46 (9.5)
BBC12.4 (4.8)10.8 (3.3)25.2 (8.2)13.2 (9.8)17.6 (16.9)18.8 (15.9)39.2 (18.7)18.8 (19.3)17.2 (11.2)12 (2.7)
Blood18.8 (7.3)9.2 (3.3)12 (2.7)50 (4.7)56 (7.5)23.6 (12.6)57.2 (8.7)97.6 (36.7)71.2 (22.8)64 (6)
Bupa17.6 (5.7)15.6 (4.8)17.2 (8.2)29.2 (6.3)35.6 (8.7)26.8 (17.4)19.6 (22.7)44.4 (18.5)30.8 (7.3)45.2 (13.2)
Compound32 (12.2)76 (10.2)52.8 (12.2)90.8 (32.1)90 (27.6)10.8 (16.4)71.2 (4.9)114 (38)140.8 (46.5)2.4 (2.1)
Flame12.4 (3)12.4 (1.3)12.8 (5.9)38 (6.6)36 (2.7)10 (11)40.8 (7)51.2 (6.5)50.4 (11.3)43.2 (1.7)
Haberman14.4 (6.3)9.2 (1.9)10.4 (2.1)35.6 (3)40.8 (6.2)4.8 (3.7)41.2 (4.2)50.4 (9.3)49.6 (10.2)45.2 (5)
Hayes–Roth17.6 (6.9)45.6 (11)37.6 (13.5)47.6 (14.2)55.6 (22.8)24 (18.5)71.2 (20.4)18 (25.7)70 (12)0.4 (1.3)
Heart14 (6.9)9.2 (1.9)12 (3.3)20.8 (14.3)14.8 (3.8)22.4 (21.4)17.2 (6)28.8 (28)14.4 (4.3)16.8 (3.2)
Iris15.6 (7.2)13.6 (3.4)23.6 (14.7)49.2 (12.2)43.2 (10.6)15.2 (20.6)55.2 (17.3)54.4 (13.8)58 (16.5)56.4 (11.7)
Jain11.6 (3.5)10.4 (2.8)8 (1.9)40 (6.3)40.8 (1.7)8.8 (10.8)45.2 (8.7)51.2 (10.6)44 (8.6)48.4 (2.3)
Lenses8.4 (3)31.6 (5.5)31.2 (4.9)28 (13.2)26 (19.9)14.8 (16.8)14.8 (6.8)8.8 (13)18 (16.6)0.8 (1.7)
Sonar16.4 (4.8)23.2 (14.9)37.6 (11.8)35.2 (7.7)34 (17.7)23.6 (14.3)35.6 (8.1)30.8 (25.2)48.4 (19)45.2 (22.2)
Spectfheart13.2 (4.6)14.4 (2.1)13.6 (2.1)28.4 (4.4)28.8 (12.8)38.8 (14.4)68 (17.6)7.6 (12.9)32.8 (7.3)48.8 (22.8)
Vertebral-column-212.4 (4.8)9.6 (3.9)9.6 (2.1)40.8 (7)42.4 (3.9)8.8 (7.5)62 (4.7)44.8 (21.6)58 (13)49.2 (4.6)
Vertebral-column-320 (7.1)34.8 (9.4)33.6 (12.2)46.8 (5.7)52.4 (6.9)8.8 (14.2)60 (13.5)70.4 (15.6)79.6 (15.8)58 (3.4)
WCD_channel18.8 (26)8 (1.9)5.6 (2.1)50.4 (11.8)58.8 (15)7.2 (3.2)56.8 (12.6)46.8 (18.1)62.8 (15.1)66.4 (17.9)
WCD_region17.2 (6.3)20.4 (4.4)43.6 (5.5)49.2 (13.6)55.6 (12.1)2.8 (1.9)30.4 (28.5)91.2 (27.2)105.6 (10.4)56.8 (20.7)
WDBC16.8 (1.7)22.8 (6)23.2 (17.5)64 (10.3)68.8 (8.2)60 (25.1)70.4 (12)73.2 (41.6)98.8 (20.6)76.8 (14.7)
Mean1826.325.545.949.618.248.75364.239.3
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Bonilla, J.; Vélez, D.; Montero, J.; Rodríguez, J.T. Fuzzy Clustering Methods with Rényi Relative Entropy and Cluster Size. Mathematics 2021, 9, 1423. https://doi.org/10.3390/math9121423

AMA Style

Bonilla J, Vélez D, Montero J, Rodríguez JT. Fuzzy Clustering Methods with Rényi Relative Entropy and Cluster Size. Mathematics. 2021; 9(12):1423. https://doi.org/10.3390/math9121423

Chicago/Turabian Style

Bonilla, Javier, Daniel Vélez, Javier Montero, and J. Tinguaro Rodríguez. 2021. "Fuzzy Clustering Methods with Rényi Relative Entropy and Cluster Size" Mathematics 9, no. 12: 1423. https://doi.org/10.3390/math9121423

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop