Using K-Means Clustering in Python with Periodic Boundary Conditions

Miniak-Górecka, Alicja; Podlaski, Krzysztof; Gwizdałła, Tomasz

doi:10.3390/sym14061237

Open AccessEditor’s ChoiceArticle

Using K-Means Clustering in Python with Periodic Boundary Conditions

by

Alicja Miniak-Górecka

^*

,

Krzysztof Podlaski

and

Tomasz Gwizdałła

Faculty of Physics and Applied Informatics, University of Lodz, Pomorska 149/153, 90-236 Lodz, Poland

^*

Author to whom correspondence should be addressed.

Symmetry 2022, 14(6), 1237; https://doi.org/10.3390/sym14061237

Submission received: 26 May 2022 / Revised: 7 June 2022 / Accepted: 12 June 2022 / Published: 14 June 2022

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

:

Periodic boundary conditions are natural in many scientific problems, and often lead to particular symmetries. Working with datasets that express periodicity properties requires special approaches when analyzing these phenomena. Periodic boundary conditions often help to solve or describe the problem in a much simpler way. The angular rotational symmetry is an example of periodic boundary conditions. This symmetry implies angular momentum conservation. On the other hand, clustering is one of the first and most basic methods used in data analysis. It is often a starting point when new data are acquired and understood. K-means clustering is one of the most commonly used clustering methods. It can be applied to many different situations with reasonably good results. Unfortunately, the original k-means approach does not cope well with the periodic properties of the data. For example, the original k-means algorithm treats a zero angle as very far from an angle that is 359 degrees. Periodic boundary conditions often change the classical distance measure and introduce an error in k-means clustering. In the paper, we discuss the problem of periodicity in the dataset and present a periodic k-means algorithm that modifies the original approach. Considering that many data scientists prefer on-the-shelf solutions, such as libraries available in Python, we present how easily they can incorporate periodicity into existing k-means implementation in the PyClustering library. It allows anyone to integrate periodic conditions without significant additional costs. The paper evaluates the described method using three different datasets: the artificial dataset, wind direction measurement, and the New York taxi service dataset. The proposed periodic k-means provides better results when the dataset manifests some periodic properties.

Keywords:

k-means; clustering; periodicity; Python

1. Introduction

Clustering is considered one of several fundamental classes of tasks that constitute data mining. It aims to divide the given dataset into subpopulations of objects that are somehow more similar to another inside the group than outside of it. Certainly, the above elementary description can be presented in a more detailed way by presenting clustering as the procedure that leads to the recognition of particular patterns or as a method that helps to find dense intervals in the search/problem spaces. The crucial property of clustering is that it can be considered a form of unsupervised learning. The lack of a supervisor is connected to the fact that we do not have the initial labels or coordinates that could help us classify the particular object to the initially predicted groups. The importance of seminal approaches can be described by mentioning the popularity of the papers introducing them. We can enlist such methods here, such as k-means [1], DBSCAN [2], or EM-clustering [3].

In our paper, we present the application of the k-means method to the datasets presenting periodic (cyclic) patterns. We expect that dividing into clusters will express the symmetry related to periodicity in the data. Numerous approaches confirm the broad application areas for such a particular version of this well-known technique. In many papers, the authors either show different problems requiring solutions by periodic-oriented clustering methods or they attempt to propose new solutions, often attempting to improve the computational complexities of existing techniques Ṗeriodic clustering usage involves a spatial–temporal pattern analysis [4]. Although the problem itself is not described as a clustering issue, initially, in the paper by [5], it was considered a form of clustering. Later, different methods, such as k-means [6] or hierarchical aggregation [7], were used. Muscle activation patterns that are repetitive during physical activities have been studied [8,9] using a hierarchical approach-based CIMAP technique applied to the EMG signal. Regarding the medical topic—the analysis of histological images involves using artificial intelligence (AI) methods, namely rough-fuzzy circular clustering, for the hue diagram normalization. Interesting results also come from the widely considered spectral analysis. This technique seems to fit periodic problems especially well. It is used for a broad class of problems, such as energy consumption [10] or plasma oscillations (detection of plasma wave modes with STFT, [11]).

Pairwise distance matrices are often used as metrics for estimating the similarity of objects. We can find such an approach in many papers related to the biophysical analyses, such as the nice mathematical justification of accelerometer data clustering in [12] or the comprehensive study of many different signals (ACC, EMG, BVP) obtained from various sensors in [13]. In [14], we can find the typical angular-based definition of distance used to study X-ray images. The authors used the method based on global optimization techniques, e.g., the simulated annealing-based procedure. AI-related methods (fuzzy clustering) are also used to search for noise sources in machinery [15]. The novel approach for pattern searching, based on the famous Hungarian algorithm, dedicated initially to solving the NP-hard problem of assignments in polynomial time, can also apply to the clustering problem [16]. Application to the grouping of objects on the sphere should also be mentioned here.

The paper is devoted to the modification of k-means, one of the most popular clustering methods. In the last 40 years, there have been many works that either improved the original algorithm [17,18,19] or suggested new (similar) approaches (such as c-means and k-medoids [20,21]). As for every clustering technique, the aim is to create a predefined number of clusters built from similar objects. The basic assumption underlying the method is to minimize within-cluster variances by grouping elements. This procedure requires the initial definition of a metric, making it possible to determine distances between the objects. It is also essential that the procedure is, in its basic formulation, NP-hard, so the stop condition also must be defined for the particular procedure. K-means can be applied to almost all data types and in many fields of study. Many recently published scientific articles used k-means in different fields [22,23,24,25,26,27]. Moreover, k-means can be used with many different metrics, such as Euclidean, Manhattan, Chebyshev, and others [28], but these metrics are not connected with any boundary conditions, especially periodic ones. In the paper, we did not cope with the problem of determining the number of clusters or initialization [19,29], as we were interested in the influence of periodic conditions on clustering itself. We compared both methods, periodic and original, with the same initialization, as well as the number of cluster parameters set.

There are several approaches to using k-means (with some modifications) where the study of cyclic phenomena can be shown. In [30], the authors used the variant of the multidimensional metric, the Mahalanobis metric, to determine the spacings between the objects. Another change of metric, this time to the Mardia–Jupp formula, was used in [31], where the k-means procedure was the first part of the expectation–minimization (EM) method. The typical time-oriented approach devoted to studying periodic weather phenomena is presented in [32], where the values of many parameters, e.g., humidity or temperature, were collected with the appropriate sensors. Clustering was the first phase of their processing. The k-means-based algorithm, k-ear, was developed to analyze the energy needs related to the seasonal access characteristics for data management systems in [33]. The supporting role of k-means in the process of data preparation for the artificial neural network was used in [34] to predict traffic flow patterns. A similar topic was discussed in [35], where the cyclic distance for time-of-day interval partition was developed to perform a traffic analysis.

The organization of the paper is as follows. In Section 2, we review how periodic boundary conditions can be incorporated into the k-means algorithm. Section 3 describes the implementation of the proposed method in Python with the use of the PyClustering library [36]. In Section 4, we present measures used in the paper for comparing different clustering methods. Section 5, Section 6 and Section 7 are devoted to measuring the effectiveness of the proposed periodic k-means approach for three different datasets: artificial, wind direction measurement in a real experiment, and New York taxi service. Section 8 concludes the paper.

2. K-Means Clustering and Periodicity

The clustering can be defined as a division of a set

X

of n data points

x_{i} \in X

,

1 \leq i \leq n

into k subsets

{C_{1}, C_{2}, . . ., C_{k}}

,

k \leq n

. Each subset

C_{i}

is called a cluster. All clusters are disjointed (i.e.,

C_{i} \cap C_{j} = \emptyset

if

i \neq j

), and the sum of these clusters contains all data points

⋃_{1 \leq i \leq n} C_{i} = X

.

There are many different clustering algorithms. In this work, we used the k-means approach [1]. The k-means algorithm is one of the most recognized algorithms for data clustering. The method is designed for data points represented as d-dimensional real-valued vectors

x_{i} \in R^{d}

. Moreover, the algorithm’s goal is to find such divisions into clusters that minimize the within-cluster sum of squares (WCCS).

WCCS = \sum_{i = 1}^{k} \sum_{x_{j} \in C_{i}} | | x_{j} - c_{i} {| |}^{2}, where : c_{i} = mean (x_{j} \in C_{i}) .

(1)

Points

c_{i}

are called centers of clusters. The k-means algorithm is relatively easy to implement (see Algorithm 1).

Algorithm 1 K-means algorithm

Task: divide the set of data points

X

into k clusters

{C_{1}, C_{2}, . . ., C_{k}}

.

Inputs: data points

X

, initial centers

{c_{1}, \dots c_{k}}

.

Output: final set of centers

{c_{1}, \dots c_{k}}

.

Parameters: number of clusters k, stop requirements.

Begin

repeat

for all

x \in X

do

Find index i of the nearest center, i.e.,

i = arg {min}_{j} | | x - c_{j} {| |}^{2}

.

Assign x to cluster

C_{i}

.

end for

Calculate the new center of all clusters

c_{i} = mean (x \in C_{i}), 1 \leq i \leq k

.

until Centers are stable within the given parameters.

End

The algorithm strongly depends on the initial selection of clusters. One of the most popular ways of selecting initial centers is denoted as k-means++ [37].

We can already see that two main factors in the k-means algorithm are:

The distance measure ${| | . | |}^{2}$
The position of the centers, i.e., the mean position within the clusters.

Most approaches that use k-means clustering concern the finite interval with well-defined boundaries and the situation must be considered differently when boundary conditions apply to the dataset. Both factors can be susceptible to various boundary conditions. This paper focuses on k-means clustering when periodic boundary conditions are taken into account. Therefore, we have to define alternative forms of distant measure and mean. In this paper, we focus on a one-dimensional case. Thus, in the rest of the paper, we assume

d = 1

.

Let us start with periodicity. The periodic boundary condition means that there exists a map

M_{T} : R \to (0, T), T \in R

, such that for all points

p \in R

:

M_{T} (p) = M_{T} (p \pm T), where : p \in R .

(2)

The parameter T is called a period. The dataset is periodic if all datapoints are under periodic boundary conditions

x \in X \mapsto M_{T} (x) .

(3)

As we can see (2), the periodic boundary conditions are a form of translational symmetry of the map

M

.

The distance between two data points

x \in (0, T)

with appropriate periodic boundary conditions have the following form:

d (x, y) = \sum_{i \leq d} m i n (| x_{i} - y_{i} |, T_{i} - | x_{i} - y_{i} {|)}^{2}, x, y \in R^{d}

(4)

where

{| . |}^{2}

is the usual norm. The distance measure

d (x, y)

can replace the traditional one-dimensional norm

| | x - {y | |}^{2}

in the k-means algorithm.

We can identify two different situations when evaluating the mean value of two data points. One is when these elements are near each other (in the sense of default

{| | . | |}^{2}

metric). In this case,

| | x - {y | |}^{2} = d (x, y)

, the original mean is the same as the periodic mean. On the other hand, if for two points

| | x - {y | |}^{2} \neq d (x, y)

, we need to shift one of the points by period T,

x \to x + T

before counting the mean.

Let us consider the angle-based example. Let

α

and

β

be two angles, and the period T equal 360

^{\circ}

. The periodic map

M_{T}

will be the same as modulo T operation. In such a case, the mean point is:

mean (α, β) = \{\begin{matrix} M_{T} (\frac{α + β}{2}), when | α - β | < \frac{T}{2} \\ M_{T} (\frac{α + β + T}{2}), when | α - β | \geq \frac{T}{2} . \end{matrix}

(5)

As we can see, if the norm-based distance between two points is smaller than

\frac{T}{2}

, the traditional mean and periodic mean are identical. This observation suggests how we can proceed with a set

𝒮

of points. First, we can divide the set

S

into two subsets

S_{L}

on the left and

S_{R}

on the right side of the

\frac{T}{2}

point. We can count the mean for these subsets independently,

μ_{L}

and

μ_{R}

. In the end, we check if

| μ_{R} - μ_{L} |^{2}

equals

d (μ_{R}, μ_{L})

, and count the weighted mean between these two points with weight being the number of points in

S_{L}

,

S_{R}

, respectively. The final value of

μ

needs to be mapped into

(0, T)

in order to preserve the periodic conditions. Usually, we need to apply the modulo operator and use the remainder as the final mean value. The procedure is presented in detail in Algorithm 2.

Algorithm 2 Periodic mean

Task: find the mean position of points

S

according to the periodic boundary condition.

Inputs: data points

S

.

Output: mean position

μ

in the set

S

.

Parameters: period T.

Begin

Create empty sets

S_{L}, S_{R}

for all

x \in S

do

if

x < T / 2

then

Add x to set

S_{L}

.

else

Add x to set

S_{R}

.

end if

end for

Count

μ_{L}

– mean position of points in

S_{L}

.

Count

N_{L}

– number of elements in

S_{L}

.

Count

μ_{R}

– mean position of points in

S_{R}

.

Count

N_{R}

– number of elements in

S_{R}

.

if

abs (μ_{R} - μ_{L}) \leq T / 2

then

μ = \frac{N_{L} μ_{L} + N_{R} μ_{R}}{N_{L} + N_{R}}

else

μ = \frac{N_{L} μ_{L} + N_{R} μ_{R} + N_{L} T}{N_{L} + N_{R}} (mod T)

end if

End

3. Modifying Out-of-the-Box Python Solution

Nowadays, many data scientists use existing tools and do not implement all original algorithms for themselves. Python is a popular solution for these scientists. We have a broad selection of libraries in this language, allowing users to achieve original goals. If anyone wants to use k-means clustering, at least two libraries provide a well-developed implementation of the algorithm: PyClustering [36] and scikit-learn [38]. The former gives the user more freedom and has more ways to be used. Nevertheless, these two out-of-the-box solutions do not allow clusterization with periodic boundary conditions. In this paper, we chose to present how the PyClustering library can be used to achieve required clustering with boundary conditions. As mentioned before, we need to change two elements in the k-means method: distance metrics and a way to count the mean position within each cluster. The exemplary implementation of the proposed approach can be found in the GitHub repository [39].

Using Periodic Measure in the PyClustering Library

The PyClustering library allows users to introduce their own distance measure instead of the default Euclidean square one. For this, we can create a class by extending pyclustering.utils.metric.distance_metric. In our implementation, the periodic distance measure is implemented in PeriodicMeasure class. In order to obtain k-means clustering that respects periodic boundary conditions, one also needs to change the method of counting the position of the center of the cluster. For this, we need to override the method _kmeans__update_centers of the k-means object. For simplicity in our implementation, we created the PeriodicMeans class. The implementation stored in the GitHub repository also contains examples of usage. The results obtained in the examples are discussed in the following sections.

4. Comparing Different Clusterizations

Clusterization is, by definition, an unsupervised machine learning algorithm. Thus, we usually do not know what the proper division is. It is tough to measure what division is better for the same reason. Therefore, we needed to approach the problem differently. We used methods derived for comparing statistical samples to measure their similarities. The well-known measures are: Rand, Adjuster Rand, Hubert, Jaccard, and Arabie–Borman [40,41,42,43]. The first four measures quantify the similarity of the two datasets. The more the results are near value one, the more similar the compared sets are. The last one (Arabie–Borman) measures dissimilarity. Therefore, the more similar the sets are—the value is nearer to zero. This paper compares the results of two clustering approaches and expects them to be as similar as possible. The measures can be used to compare two cluster distributions pairwise.

On the other hand, we are discussing k-means clustering. The goal of this method is the minimization of WCCS (1). The WCCS can also be used for comparing two k-means-based approaches. Therefore, the best k-means clustering approach for a given dataset and a selected number of clusters k is the one with the lowest value of WCCS.

5. Artificial Dataset

We created an artificial dataset randomized with the use of function

g (x)

, the function is not normalized and it is not a distribution in the classical sense. Nevertheless, it can be used to generate random points. The function

g (x)

is a superposition of four Gaussian-like functions:

g (x) = \sum_{i = 1}^{4} s_{i} e^{- {(\frac{x - μ_{i}}{σ_{i}})}^{2}},

(6)

where

s_{i}, μ_{i}

, and

σ_{i}

are the scale factors, mean, and original deviation for each Gaussian part, respectively. Using function (6) we generated 12,000 points. The function and histogram of the generated data are presented in Figure 1.

We can observe that data are spread between 0 and 360, which can be understood as an angle measured in degrees. We have four peaks at 72, 144, 216, and 288 degrees. The peaks differ in height and width. The data represent angle distributions in an imaginary experiment. This initial dataset (

D_{0}

) is well distributed and can easily be divided into four clusters using the k-means method. This clustering will be used as the reference one. We can see that each cluster corresponds to a single peak in the histogram. We believe that if we shift the orientation of the equipment in an experiment, the clusters should look exactly the same. Simply speaking, no matter where we place “angle 0”, the division into four clusters should be connected with peaks in the data histogram.

To check how the k-means will respond to the change of equipment orientation, we created two additional datasets (

D_{90}

,

D_{280}

) from the initial one. Each dataset was generated by shifting all data points by 90 and 280 degrees, respectively, and applying periodic angular conditions. All three datasets are presented in Figure 2.

The shift values were selected intentionally to consider the situation where the new “0-degree point” was located, close to the expected boundaries, between clusters, as well as where a cluster was divided, almost exactly in half-width. All three artificial datasets had identical shapes but were shifted by a given angle. When clusterization was applied, we expected to obtain the same or at least very similar clusters, which means we expected the centers of the clusters to shift by the given angle. Unfortunately, when we applied the k-means clustering method included in the PyClustering library, we obtained three different sets of clusters (see Figure 3). Each cluster is denoted on the histogram by a different color; a black triangle indicates the position of the center of a cluster.

Looking carefully at the clusterization results (Figure 3), we can observe that the original k-means had to end a cluster at the edges of the data domain. Therefore, the two border cluster centers moved toward the middle point (angle 180). This behavior is easily visible for dataset

D_{280}

Figure 3c. Here, near 360 degrees, the original widest peak was split in two. In the result, the left-most cluster overtook part of the next to the right smaller cluster (the second peak from the left). Moreover, we can see that this gravely influenced the positions of the centers of the other clusters. For example, it affected the position of the center of the second (orange) cluster. The center point, in this case, moved to a position near 100 degrees and was placed in the gap between two histogram peaks. It was not an expected behavior as the data were generated using four Gaussian-like models, and we expected the centers to be located near histogram peaks. We observed fewer visual disturbances in the clustering results of dataset

D_{90}

Figure 3b. As in this case, the slope of one of the smaller peaks was cut at the edge of the data domain

(0, 360)

. During original k-means clustering, the cluster most to the right absorbed the peak slope. However, the changes in the positions of the cluster centers were not so significant in this case.

Let us compare the similarity of the clusters obtained for these three artificial datasets using original and periodic k-means. We used well-known metrics to compare the obtained sets of clusters. The values for all pairwise calculated indexes are presented in Table 1. We can see that both clustering methods give exactly the same clusters for the initial dataset. The evaluated Rand, Adjuster Rand, Hubert, and Jaccard similarity indexes have values equal to one, and Arabie–Boorman has a value equal to zero, which proves that we have the same clusters given by both methods. Unfortunately, when we compared clusters obtained for initial data and data obtained by shifts by 90 and 280 degrees, the clusters obtained by original k-means were not the same. It proves that original k-means clustering does not behave well for datasets with periodic boundary conditions. On the other hand, we can see that when we applied our modification of the k-means (periodic k-means), we obtained the same clusters as for the initial data, so the shift angle did not have an impact on the clustering result. The clustering is connected strictly with peaks in the data histogram and takes into account periodicity in the dataset. Moreover, we can see that the periodic approach for all datasets

{D_{0}, D_{90}, D_{280}}

gives results with a minimal value of WCCS (see Table 2). The measured WCCS for clusters obtained with the original k-means for datasets

{D_{90}, D_{280}}

was higher than the one obtained by periodic k-means.

The observations of the visualizations of clusterization results (Figure 3) and a comparison of the obtained sets of clusters (Table 1) show that the original k-means clustering does not cope well with our artificial angular-based dataset. For the analyzed shifts, the clusters were different. In the results, some elements in the same cluster in the initial set became divided into different clusters in the shifted dataset. On the other hand, the proposed periodic k-means gave exactly the same clusters and did not depend on applying the shift to the data. The artificial dataset allowed us to have complete control over the data and understand how the data should have been split into clusters (i.e., how many clusters should have been used). We can use the proposed method for different real periodic datasets.

6. Clustering a Real Angular Dataset

The proposed modified k-means worked nicely for the artificial dataset. Let us now check how it works for real data obtained in an experiment. The data considered were wind direction measurements conducted at Biebrza National Park’s wetlands in northeastern Poland in May 2016 [44,45]. The wind direction is an angle deviation from the north direction and must be treated as periodic data. The dataset contained wind direction measurements taken every hour for one month. The data histogram is presented in Figure 4. It should be noted that the histogram has two bigger peaks at about 60 and 340 degrees and one smaller at 190 degrees. Moreover, we can observe the gap between the two highest peaks at around 20 degrees, and the bars near angle zero belong to the peak slope at 340 degrees. In our analysis, we used k-means clustering with the number of clusters set to three and two. As one can see, the original k-means treats angles around 0 degrees as very distant from those angles around 360; as a result, these two regions cannot be placed in one cluster. The clustering obtained using the original k-means is different from the one obtained with the proposed periodic k-means approach (see Figure 5 and Table 3). Comparing the results of the two approaches used, we can observe that the periodic clusters are nearer to the expected ones. Moreover, periodic k-means gives lower WCCS than the original k-means. For measuring the effectiveness, we can use the ratio:

\frac{{WCCS}_{periodic}}{{WCCS}_{original}} = \{\begin{matrix} 0.95 & for : k = 2 \\ 0.94 & for : k = 3 . \end{matrix}

(7)

It proves that periodic k-means results in the lower within-cluster sum of squares for wind direction data. It is worth mentioning that minimizing WCCS is the primary goal of the k-means algorithm.

7. Clustering Seasonal Data

The other type of periodic behavior is connected with the seasonal one. Periodicity is often implied in the data when someone tries to analyze seasonal behavior. For example, we were interested in the day of the year, ignoring the exact year. We could go downward, taking into account the dependence on the day of the week or hour in a day. The last dataset used for the evaluation of the proposed periodic k-means was the New York taxi dataset [46]. The dataset was based on taxi services registered in New York. It contained information, such as a taxi company, pickup/dropout date, time, and location, as well as the number of passengers. We used data about events that happened during the first half of 2016; the data contained more than 46,000 data instances. We only used information involving the date and time of a passenger pickup, and we used two different data configurations. In the first one, we were interested in the occurrences of pickups on an average day. In the other, the week was the period under consideration. The results obtained using original and periodic k-means are shown in Figure 6 and Table 4. For example, in the case of daily perspectives and four clusters, the WCCS ratio

\frac{{WCCS}_{periodic}}{{WCCS}_{original}}

was equal to

0.87

. In all cases, periodic clustering gave lower WCCS values when the same number of clusters were selected. We also noticed that the more clusters—the less impact of periodic conditions. This is understandable, as with the increase of cluster numbers, the size of each cluster decreases, and the smaller numbers of elements are exchanged between edge clusters. The results prove that periodic k-means can discover the seasonal properties of the data. Midnight is not a crucial moment for a taxi service, and in real life, we do not split it into two days for a taxi company in New York City. The periodic k-means approach visualizes this pickup time property. We notice that by dividing a day into four clusters, the periodic k-means distinguishes the night period that starts around 21:00 and ends before 5:00 the next day (see Figure 6a). It cannot be observed while the original k-means is used.

8. Conclusions

We considered the problem of clusterization when a dataset manifests periodic boundary properties. If one needs to consider boundary conditions during clustering, a prepared method must be implemented from scratch. The k-means algorithm was chosen for a more detailed analysis as this method is likely the most used in data science. We proved that classical k-means has problems with clusterization when data are periodic. In such a case, the classical k-means approach does not give optimal clustering results. We have shown that the k-means algorithm can be modified to take periodic boundary conditions into account. The proposed modification of k-means was tested on three different datasets. The first, an artificial dataset, was used to show how the method works. Later, we showed that using the created solution with real datasets, such as wind direction and New York taxi, can be used in real-life problems. The modified k-means method for all three datasets gave better results than the original one. We observed that, for each of the cases, the periodic k-means resulted in a lower WCCS measure than the original one. Figure 3, Figure 5, and Figure 6 show that the periodic conditions influence the most boundary clusters. In the original 1D k-means, the minimal and maximal points were always the edge points of these two clusters. Periodic k-means depends less on the selection of equipment orientation or time zone and can discover natural properties of the dataset, such as the nighttime for the taxi service.

When periodicity is considered, the border between these clusters appears and results in moving some data points from one to the other; as a result, some other centers are also moved. Moreover, we can see that clusters better represent the data, generalizing initial choices of data representations, such as the selection of the zero angle or zero hour.

Many researchers require the use of on-the-shelf, well-tested solutions that can easily be used in their research. As Python is one of the most respected tools in data science, we showed how the PyClustering library can be used for k-means clusterization, respecting periodic boundary conditions. The reference implementation is freely published on a public GitHub repository and available for anyone to use.

The presented solution is for presentation simplicity, reduced to 1D cases. However, it can easily be generalized to higher-dimension cases, where only some variables are subject to periodic boundary conditions. Moreover, we showed that periodic properties of data impact clustering procedures. In this paper, we only discussed the k-means method; other similar methods, such as c-means and k-medoids, will be analyzed in the near future.

Author Contributions

Conceptualization, A.M.-G., K.P. and T.G.; Methodology, A.M.-G., K.P. and T.G.; Software, K.P.; Writing—Original Draft Preparation, A.M.-G., K.P. and T.G.; Writing—Review & Editing, A.M.-G., K.P. and T.G.; Funding Acquisition, A.M.-G. All authors have read and agreed to the published version of the manuscript.

Funding

University of Lodz, IDUB B2111501000004.07.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A K-Means Clustering Algorithm. Appl. Stat. 1979, 28, 100. [Google Scholar] [CrossRef]
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231. [Google Scholar]
Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data Via the EM Algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–22. [Google Scholar]
Agrawal, R.; Srikant, R. Mining sequential patterns. In Proceedings of the Proceedings of the Eleventh International Conference on Data Engineering, Taipei, Taiwan, 6–10 March 1995. [Google Scholar] [CrossRef]
Cao, H.; Mamoulis, N.; Cheung, D.W. Discovery of Periodic Patterns in Spatiotemporal Sequences. IEEE Trans. Knowl. Data Eng. 2007, 19, 453–467. [Google Scholar] [CrossRef] [Green Version]
Chan, S.; Leong, K. An application of Cyclic Signature (CS) clustering for spatial-temporal pattern analysis to support public safety work. In Proceedings of the 2010 IEEE International Conference on Systems, Man and Cybernetics, Istanbul, Turkey, 10–13 October 2010. [Google Scholar] [CrossRef]
Zhang, D.; Lee, K.; Lee, I. Hierarchical trajectory clustering for spatio-temporal periodic pattern mining. Expert Syst. Appl. 2018, 92, 1–11. [Google Scholar] [CrossRef]
Rosati, S.; Agostini, V.; Knaflitz, M.; Balestra, G. Muscle activation patterns during gait: A hierarchical clustering analysis. Biomed. Signal Process. Control. 2017, 31, 463–469. [Google Scholar] [CrossRef] [Green Version]
Agostini, V.; Rosati, S.; Castagneri, C.; Balestra, G.; Knaflitz, M. Clustering analysis of EMG cyclic patterns: A validation study across multiple locomotion pathologies. In Proceedings of the 2017 IEEE International Instrumentation and Measurement Technology Conference (I2MTC), Torino, Italy, 22–25 May 2017. [Google Scholar] [CrossRef] [Green Version]
Giordano, F.; Rocca, M.L.; Parrella, M.L. Clustering complex time-series databases by using periodic components. Stat. Anal. Data Min. ASA Data Sci. 2017, 10, 89–106. [Google Scholar] [CrossRef]
Haskey, S.; Blackwell, B.; Pretty, D. Clustering of periodic multichannel timeseries data with application to plasma fluctuations. Comput. Phys. Commun. 2014, 185, 1669–1680. [Google Scholar] [CrossRef]
Grabovoy, A.V.; Strijov, V.V. Quasi-Periodic Time Series Clustering for Human Activity Recognition. Lobachevskii J. Math. 2020, 41, 333–339. [Google Scholar] [CrossRef]
Nunes, N.; Araújo, T.; Gamboa, H. Time Series Clustering Algorithm for Two-Modes Cyclic Biosignals. In Biomedical Engineering Systems and Technologies; Springer: Berlin/Heidelberg, Germany, 2013; pp. 233–245. [Google Scholar] [CrossRef]
Abraham, C.; Molinari, N.; Servien, R. Unsupervised clustering of multivariate circular data. Stat. Med. 2012, 32, 1376–1382. [Google Scholar] [CrossRef] [Green Version]
Tóth, B.; Vad, J. A fuzzy clustering method for periodic data, applied for processing turbomachinery beamforming maps. J. Sound Vib. 2018, 434, 298–313. [Google Scholar] [CrossRef]
Kume, A.; Walker, S.G. The utility of clusters and a Hungarian clustering algorithm. PLoS ONE 2021, 16, e0255174. [Google Scholar] [CrossRef]
Lu, H.; He, T.; Wang, S.; Liu, C.; Mahdavi, M.; Narayanan, V.; Chan, K.S.; Pasteris, S. Communication-efficient k-Means for Edge-based Machine Learning. IEEE Trans. Parallel Distrib. Syst. 2022, 33, 2509–2523. [Google Scholar] [CrossRef]
Fang, C.; Liu, H. Research and Application of Improved Clustering Algorithm in Retail Customer Classification. Symmetry 2021, 13, 1789. [Google Scholar] [CrossRef]
Fränti, P.; Sieranoja, S. How much can k-means be improved by using better initialization and repeats? Pattern Recognit. 2019, 93, 95–112. [Google Scholar] [CrossRef]
Kaufman, L.; Rousseeuw, P.J. Partitioning Around Medoids (Program PAM). In Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1990; pp. 68–125. [Google Scholar]
Dunn, J.C. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. J. Cybern. 1973, 3, 32–57. [Google Scholar] [CrossRef]
Hany, O.; Abu-Elkheir, M. Detecting Vulnerabilities in Source Code Using Machine Learning. In Lecture Notes in Networks and Systems; Springer: Berlin/Heidelberg, Germany, 2022; pp. 35–41. [Google Scholar] [CrossRef]
Inan, M.S.K.; Alam, F.I.; Hasan, R. Deep integrated pipeline of segmentation guided classification of breast cancer from ultrasound images. Biomed. Signal Process. Control. 2022, 75, 103553. [Google Scholar] [CrossRef]
Chen, M.; Zhang, Z.; Wu, H.; Xie, S.; Wang, H. Otsu-Kmeans gravity-based multi-spots center extraction method for microlens array imaging system. Opt. Lasers Eng. 2022, 152, 106968. [Google Scholar] [CrossRef]
Balsor, J.L.; Arbabi, K.; Singh, D.; Kwan, R.; Zaslavsky, J.; Jeyanesan, E.; Murphy, K.M. Corrigendum: A Practical Guide to Sparse k-Means Clustering for Studying Molecular Development of the Human Brain. Front. Neurosci. 2022, 16. [Google Scholar] [CrossRef]
Zhao, M.; Wang, Y.; Wang, X.; Chang, J.; Zhou, Y.; Liu, T. Modeling and Simulation of Large-Scale Wind Power Base Output Considering the Clustering Characteristics and Correlation of Wind Farms. Front. Energy Res. 2022, 10. [Google Scholar] [CrossRef]
Wu, X.; Zhang, J.; Lau, A.P.T.; Lu, C. Low-complexity absolute-term based nonlinear equalizer with weight sharing for C-band 85-GBaud OOK transmission over a 100-km SSMF. Opt. Lett. 2022, 47, 1565. [Google Scholar] [CrossRef]
Bora, M.D.J.; Gupta, D.A.K. Effect of Different Distance Measures on the Performance of K-Means Algorithm: An Experimental Study in Matlab. arXiv 2014, arXiv:1405.7471. [Google Scholar] [CrossRef]
Jain, A.K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 2010, 31, 651–666. [Google Scholar] [CrossRef]
Charalampidis, D. A modified k-means algorithm for circular invariant clustering. IEEE Trans. Pattern Anal. Mach. 2005, 27, 1856–1865. [Google Scholar] [CrossRef]
Vejmelka, M.; Muslek, P.; Paluš, M.; Pelikán, E. K-means Clustering for Problems with Periodic Attributes. Int. J. Pattern Recognit. Artif. 2009, 23, 721–743. [Google Scholar] [CrossRef]
Harb, H.; Makhoul, A.; Laiymani, D.; Jaber, A.; Tawil, R. K-means based clustering approach for data aggregation in periodic sensor networks. In Proceedings of the 2014 IEEE 10th International Conference on Wireless and Mobile Computing, Networking and Communications (WiMob), Larnaca, Cyprus, 8–10 October 2014. [Google Scholar] [CrossRef]
You, X.; Sun, T.; Sun, D.; Liu, X.; Lv, X.; Buyya, R. K-ear: Extracting data access periodic characteristics for energy-aware data clustering and storing in cloud storage systems. Concurr. Comput. Pract. Exp. 2021, 33, e6096. [Google Scholar] [CrossRef]
Doğan, E. Short-term Traffic Flow Prediction Using Artificial Intelligence with Periodic Clustering and Elected Set. Promet-Traffic Transp. 2020, 32, 65–78. [Google Scholar] [CrossRef]
Wang, G.; Qin, W.; Wang, Y. Cyclic Weighted k-means Method with Application to Time-of-Day Interval Partition. Sustainability 2021, 13, 4796. [Google Scholar] [CrossRef]
Novikov, A. PyClustering: Data Mining Library. J. Open Source Softw. 2019, 4, 1230. [Google Scholar] [CrossRef]
Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, LA, USA, 7–9 January 2007; pp. 1027–1035. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Podlaski, K. Periodic K-Means Exemplary Implementation. Available online: https://github.com/kpodlaski/periodic-kmeans (accessed on 25 May 2022).
Rand, W. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 1971, 66, 846–850. [Google Scholar] [CrossRef]
Fowlkes, E.; Mallows, C. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 1983, 78, 553–556. [Google Scholar] [CrossRef]
Hubert, L.; Arabie, P. Comparing partitions. J. Classif. 1985, 2, 193–218. [Google Scholar] [CrossRef]
Warrens, M. On the equivalence of Cohen’s kappa and the Hubert-Arabie adjusted Rand index. J. Classif. 2008, 25, 177–183. [Google Scholar] [CrossRef] [Green Version]
Fortuniak, K.; Pawlak, W.; Bednorz, L.; Grygoruk, M.; Siedlecki, M.; Zielinski, M. Methane and carbon dioxide fluxes of a temperate mire in Central Europe. Agric. For. Meteorol. 2017, 232, 306–318. [Google Scholar] [CrossRef]
Podlaski, K.; Durka, M.; Gwizdałła, T.; Miniak-Górecka, A.; Fortuniak, K.; Pawlak, W. LSTM Processing of Experimental Time Series with Varied Quality. In Computational Science—ICCS 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 581–593. [Google Scholar]
NYC Taxi and Limousine Commission (TLC). Available online: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml (accessed on 16 May 2022).

Figure 1. Artificial dataset. (a) Data distribution; (b) data histogram.

Figure 2. Three artificial datasets used for k-means clustering experiments. (a) Histogram for

D_{0}

dataset. (b) Histogram for

D_{90}

dataset. (c) Histogram for

D_{280}

dataset.

Figure 2. Three artificial datasets used for k-means clustering experiments. (a) Histogram for

D_{0}

dataset. (b) Histogram for

D_{90}

dataset. (c) Histogram for

D_{280}

dataset.

Figure 3. Results of clustering using the k-means approach. The graphs above represent the original k-means approach; below, we present the results obtained by the proposed periodic modification of the k-means method. The clusters are distinguished by the colors of appropriate bars in the histogram. Black arrows on the x-axis annotate the centers found by each method. (a) Clusters for dataset

D_{0}

. (b) Clusters for dataset

D_{90}

. (c) Clusters for dataset

D_{280}

.

Figure 3. Results of clustering using the k-means approach. The graphs above represent the original k-means approach; below, we present the results obtained by the proposed periodic modification of the k-means method. The clusters are distinguished by the colors of appropriate bars in the histogram. Black arrows on the x-axis annotate the centers found by each method. (a) Clusters for dataset

D_{0}

. (b) Clusters for dataset

D_{90}

. (c) Clusters for dataset

D_{280}

.

Figure 4. Histogram of wind direction measured at Biebrza National Park’s wetlands in May 2016.

Figure 5. Results of clustering into two and three clusters of wind direction data using k-means approaches. The graph above represents the original k-means approach; below we have the results obtained by the proposed periodic modification of the k-means method. The clusters are distinguished by the colors of appropriate bars in the histogram. Black arrows on the x-axis annotate the centers found by each method. (a) Clustering into two clusters; (b) clustering into three clusters.

Figure 6. Results of clusterization of the New York taxi dataset with original and periodic k-means for the selected number of clusters and period set to one day and one week, respectively (see Table 4). The clusters are distinguished by the colors of the appropriate bars in the histogram. Black arrows on the x-axis annotate the centers found by each method. (a) Day, four clusters. (b) Day, seven clusters. (c) Day, twelve clusters. (d) Week, three clusters. (e) Week, five clusters. (f) Week, seven clusters.

Table 1. Comparison of clusters obtained in both k-means approaches.

Clustering 1		Clustering 2		Rand	Adjusted	Arabie	Hubert	Fowles	Jaccard
Dataset	Method	Dataset	Method	Rand	Adjusted	Arabie	Hubert	Fowles	Jaccard
$D_{0}$	original	$D_{0}$	periodic	1.000	1.000	0.000	1.000	1.000	1.000
$D_{0}$	original	$D_{90}$	original	0.981	0.961	0.019	0.962	0.977	0.955
$D_{0}$	original	$D_{90}$	periodic	1.000	1.000	0.000	1.000	1.000	1.000
$D_{0}$	original	$D_{280}$	original	0.804	0.575	0.196	0.607	0.731	0.559
$D_{0}$	original	$D_{280}$	periodic	1.000	1.000	0.000	1.000	1.000	1.000
$D_{0}$	periodic	$D_{90}$	original	0.981	0.961	0.019	0.962	0.977	0.955
$D_{0}$	periodic	$D_{90}$	periodic	1.000	1.000	0.000	1.000	1.000	1.000
$D_{0}$	periodic	$D_{280}$	original	0.804	0.575	0.196	0.607	0.731	0.559
$D_{0}$	periodic	$D_{280}$	periodic	1.000	1.000	0.000	1.000	1.000	1.000
$D_{90}$	original	$D_{90}$	periodic	0.981	0.961	0.019	0.962	0.977	0.955
$D_{90}$	original	$D_{280}$	original	0.789	0.540	0.211	0.578	0.705	0.530
$D_{90}$	original	$D_{280}$	periodic	0.981	0.961	0.019	0.962	0.977	0.955
$D_{90}$	periodic	$D_{280}$	original	0.804	0.575	0.196	0.607	0.731	0.559
$D_{90}$	periodic	$D_{280}$	periodic	1.000	1.000	0.000	1.000	1.000	1.000
$D_{280}$	original	$D_{280}$	periodic	0.804	0.575	0.196	0.607	0.731	0.559

Table 2. WCCS of clusters obtained with both k-means procedures for artificial datasets

D_{0}, D_{90}, D_{280}

. In the last column, we present the ratio of

\frac{{WCCS}_{periodic}}{{WCCS}_{original}}

for the given dataset.

Table 2. WCCS of clusters obtained with both k-means procedures for artificial datasets

D_{0}, D_{90}, D_{280}

. In the last column, we present the ratio of

\frac{{WCCS}_{periodic}}{{WCCS}_{original}}

for the given dataset.

Dataset	Mehod	WCCS	Ratio
$D_{0}$	original	3,385,241	1.00
$D_{0}$	periodic	3,385,241	1.00
$D_{90}$	original	3,385,241	0.92
$D_{90}$	periodic	3,653,802	0.92
$D_{280}$	original	3,385,241	0.91
$D_{280}$	periodic	3,695,809	0.91

Table 3. Results for clustering of the wind direction dataset into two and three clusters with original k-means and periodic k-means.

Algorithm	Centers	WCCS	Centers	WCCS
periodic	2	1,851,331	3	662,869
original	2	1,941,053	3	706,689

Table 4. Results for the clustering of the taxi dataset with data arranged for one day and one week periods, respectively (see Figure 6).

Day Data
Algorithm	Centers	WCCS	Centers	WCCS	Centers	WCCS
periodic	4	3,619,456	7	1,306,400	12	452,409
original	4	4,137,528	7	1,314,643	12	475,518
Week Data
Algorithm	Centers	WCCS	Centers	WCCS	Centers	WCCS
periodic	3	641,157	5	232,192	7	82,427
original	3	644,370	5	232,805	7	84,376

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Miniak-Górecka, A.; Podlaski, K.; Gwizdałła, T. Using K-Means Clustering in Python with Periodic Boundary Conditions. Symmetry 2022, 14, 1237. https://doi.org/10.3390/sym14061237

AMA Style

Miniak-Górecka A, Podlaski K, Gwizdałła T. Using K-Means Clustering in Python with Periodic Boundary Conditions. Symmetry. 2022; 14(6):1237. https://doi.org/10.3390/sym14061237

Chicago/Turabian Style

Miniak-Górecka, Alicja, Krzysztof Podlaski, and Tomasz Gwizdałła. 2022. "Using K-Means Clustering in Python with Periodic Boundary Conditions" Symmetry 14, no. 6: 1237. https://doi.org/10.3390/sym14061237

APA Style

Miniak-Górecka, A., Podlaski, K., & Gwizdałła, T. (2022). Using K-Means Clustering in Python with Periodic Boundary Conditions. Symmetry, 14(6), 1237. https://doi.org/10.3390/sym14061237

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Using K-Means Clustering in Python with Periodic Boundary Conditions

Abstract

1. Introduction

2. K-Means Clustering and Periodicity

3. Modifying Out-of-the-Box Python Solution

Using Periodic Measure in the PyClustering Library

4. Comparing Different Clusterizations

5. Artificial Dataset

6. Clustering a Real Angular Dataset

7. Clustering Seasonal Data

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI