2.1. Idea of Hybrid Clustering
Clustering methods are widely used to deal with large-scale data mining problems. Aiming at classifying data into some homogeneous clusters based on their similarities, existing clustering methods are categorized into density-based methods, partition-based methods, hierarchical-based methods and so on. Among the various clustering methods, k-means has been widely applied as a simple and representative partition-based clustering algorithm [
20]. However, its drawbacks lie in that it is only suitable for distinguishing clusters with hyperspherical distributions of data, and the clustering results are very sensitive to the initial parameters. Therefore, several extensions of the original k-means algorithm have been presented. For example, Ref. [
21] presented an improved initialization scheme of k-means that makes the initial center points as far apart from each other as possible. Ref. [
22] designed the K-MODES algorithm to solve the disadvantage that k-means can only deal with numerical data. Density-based methods such as the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [
23] can automatically determine the number of cluster centers and identify arbitrary-shaped clusters. However, the clustering results of DBSCAN are strongly dependent on the global parameter settings, which make it unable to handle clusters with distinct density. In addition, algorithms like DBSCAN only perform a single step to calculate cluster centers and assign a cluster label for each data sample without any iteration or optimization. Subsequently, a method called Clustering by Fast Search and Find of Density Peaks (CFSFDP) was devised by [
24], which assumes that cluster centers are characterized as the local density maxima far away from any points of higher density. Since CFSFDP only focuses on the points with high density that are relatively far from each other as the center point, it is easy to divide the same cluster that contains multiple high-density points into multiple clusters. To address the above drawback, a density peak clustering algorithm was proposed using global and local consistency-adjustable manifold distances (McDPC) [
25]. Hierarchical-based clustering methods can be classified as being either bottom-up [
26] or top-down [
27] based on how hierarchical decomposition is formed. In these methods, some parameters need to be specified in advance, and the clustering results are significantly sensitive to them, which leads to a limitation of the application scope of these methods.
Table 1 lists the advantages and disadvantages of several different kinds of artificial intelligence methods for pulsar candidate selection. As can be seen in
Table 1, the idea of hybrid clustering is to effectively combine the advantages of the aforementioned different clustering algorithms, which could be more suitable for the multiple shapes of data distribution. Therefore, it can further ensure the depth and stability of data mining for pulsar candidate signals compared with other kinds of methods. Similarly, the goal of the hybrid clustering scheme of PHCAL is to combine the clustering idea based on density hierarchy and partition, which will bring the following benefits: (i) The adaptability for auto-determining the cluster number and initial clustering centers on a dataset with an irregular shape distribution can be improved. (ii) Multi-center clusters can be identified, and the clusters with lower local density can be detected to a certain extent. (iii) The more stable clusters and outliers could be obtained through the similarity metric and iterative calculation. The detailed steps of this hybrid clustering scheme are described in the next section.
2.2. The Hybrid Clustering Scheme
The detailed steps of the hybrid clustering scheme PHCAL are concluded as follows:
Step 1: Data preprocessing. The feature selection and dimensionality reduction of pulsar candidate data are carried out by Principal Component Analysis (PCA) so as to obtain input datasets from a new feature space in which the feature vector is. Notice that the optional physical eigenvalues of the pulsar candidate mainly consist of pulse radiation (e.g., single peak, double peak and multi peak), period, DM, SNR, noise signal, signal ramp, the sum of incoherent power and coherent power.
Step 2: To better reflect different data structures of clusters in the clustering results, the k-nearest neighbor-based local polynomial kernel function is adopted for the density computing on an input dataset. The Mahalanobis distance between any two data points
and
is defined as Equation (
1).
where
S is a covariance matrix of multi-dimensional random variables. Then,
denotes the local density of data point
can be calculated by the k-nearest neighbor-based polynomial function, as shown in Equation (
2) since the global characteristics of the polynomial function make it have a strong generalization performance.
where
is the offset coefficient and
d is the order of the polynomial. To eliminate the affect of data variation and numerical value difference, both
and
are processed by dispersion standardization as follows.
where
and
denote the minimum value of
and
separately, while
and
denote the maximum value of them separately.
Then, outliers from all data points are removed by Equation (
5), which is helpful for the selection of cluster center points. Note that there may be some outliers screened out with very small density values, which are very few in number and are marginalized in the data distribution. Moreover, they could be some special pulsars that need to be determined.
In addition,
denotes the minimum distance between data point
and other higher-density data points, and can be calculated according to Equation (
6).
Step 3: A combination of the idea of density peak and hierarchy is used to divide the hierarchy of multi-density clusters so as to determine the initial cluster center points. All the values of
and
can be applied to the two-dimensional decision graph, as shown in
Figure 1. An interval length
is defined to equally divide the vertical axis, which denotes
of data point
. The vertical division process is named
-cut. Similarly, another interval length
is defined to equally divide the horizontal axis, which denotes
of data point
. The horizontal division process is named
-cut. In the derived decision graph, the algorithm can automatically distinguish different density levels by using the
-cut and then identify the representative data points in these density levels, which are further applied to obtain all micro-clusters. Finally, the micro-cluster grouping is carried out in two ways: (i) At a low density level (not at the maximum density level), the corresponding micro-clusters should be combined together into one cluster. (ii) At the maximum density level, all the identified representative data points should be selected as cluster centers if they are at the same
level; on the contrary, they should be merged into one single cluster if they are spread across different
levels in the same region.
Step 4: After the number of clusters, k, and the original cluster centers were determined, the k-means iteration of distance computation of Gaussian radial basis kernel (RBF) is introduced to allocate all data points and optimize the cluster centers. According to the principle of proximity, each data point
is assigned to the nearest cluster
, which means the distance between
and the center point
is the closest. The distance computation based on the RBF function is defined as Equation (
7) to improve the similarity measure between two points.
where
denotes the width of the kernel function. Obviously, the local characteristics and strong learning ability of the RBF function are helping realize the mapping from measure distances to a high-dimensional space. Subsequently, each cluster
changes to
, and the corresponding center
is refreshed as
according to Equation (
8).
At each iteration, the sum of squares of errors (SSE) for all data points is calculated according to Equation (
9). Until the SSE value does not change compared with the last round, the program ends. Otherwise, it returns to the next refreshing cycle. The detailed flow of the hybrid clustering scheme is summarized in
Figure 2.
2.3. Data Partition Strategy for Parallelization
The statistics indicate that a drift scan observation with the FAST telescope can provide more than 30,000 pulsar candidates from one periodicity search (10 h or more). Furthermore, these candidates can be stored and accumulated for offline analysis and increase to a large volume over time. Hence, it is very necessary to study the parallel implementation of the hybrid clustering scheme based on parallelization models such as Mapreduce/Spark. On the one hand, a reasonable data partition strategy makes the Recall and Accuracy of the clustering results improved further; on the other hand, the time performance can also be improved by the reduction of the number of comparisons among samples.
To delineate a more comprehensive range of pulsar identification and maximize the accuracy of candidate selection based on the data structure, the sliding window concept [
28] is introduced for data partitioning of candidate data streams. As shown in
Figure 3, a window with Batchsize =
w is delineated for each round. In addition, a relatively complete set selected from the real pulsar samples of each type is prepared and added to the data group to be detected (shadow areas) in each round in a specific ratio (
v:
w). At present, there is a basic assumption for clustering; that is, the samples in the same cluster are more likely to have the same markers. Therefore, the decision boundaries are set according to the dense or sparse regions of the distribution of each type of data so that the pulsar data areas and non-pulsar interference signal areas can be determined and distinguished from the derived distribution graph. By calculating the proportion of pulsar samples in each cluster, all clusters are sorted in descending order. Then, the front clusters whose proportion is greater than a certain threshold can be selected to enter a pulsar candidate list for further pulsar candidate identification. Moreover, the outliers excluded by step 2 in
Section 2.2 are likely to be special pulsars, which should be further investigated. This proposed data partition strategy enables one data sample to appear in two or more blocks with a different data distribution that makes it possible for some samples incorrectly divided in one block to be correctly identified in another block, so as to further improve the overall Recall and Accuracy of the clustering results.
2.4. A Spark-Based Parallelization Model
Spark is considered a general computing engine for large-scale data processing. Therefore, a preliminary Spark-based model of PHCAL was presented, whose computing process is shown in
Figure 4. The specific steps are concluded as follows.
Step 1: Carry out the parallel data partition processing, and then calculate the density value of each sample point and the Mahalanobis distance between sample points in each block. Notice that a task will be generated in each block during calculating.
Step 2: For each block, merge the corresponding micro-clusters according to step 3 in
Section 2.2, and then determine the initial cluster centers.
Step 3: Traverse the initial cluster center points of each block in turn, which will be transmitted to each working node as broadcast variables in a cycle for parallel calculation.
Step 4: For the current block, read all the data points into memory if not loaded yet, then find out the nearest cluster center for each point, and then all the corresponding relationships will be collected and stored in the form of key-value pairs. Notice that M denotes the number of partitions when data are read into memory, and M is set by the system according to the hardware configuration.
Step 5: Use the map operator and aggregation operators to handle the key-value pair variables so as to update the cluster centers of the block according to step 4 in
Section 2.2.
Step 6: Judge whether the distance between the new centers and their corresponding old centers is less than the set threshold. If yes, the clustering of this block is complete and move to the next block; otherwise, repeat step 4–step 6.
2.5. Time Complexity Analysis
To improve the clustering effect, the clustering scheme of PHCAL combines the idea of density hierarchy-based and partition-based clustering algorithms. Therefore, the serial clustering process of PHCAL without applying the parallelization scheme in
Section 2.3 and
Section 2.4 is more complex than the other three common serial classifiers, i.e., k-means++ [
21], McDPC [
25] and KNN [
29]. However, the correct parallelization method can make up for this deficiency as much as possible. Let the number of samples in the experimental dataset be
n; then
Table 2 shows the time complexity of the serial mode of PHCAL compared to these three serial algorithms. (i) The time complexity of k-means++ is
. Generally, k, T and M are considered constants, so it can be simplified to O(
n). (ii) For McDPC, the time complexity of computing parameters
and
is
, and the time complexity of clustering based on different density levels is
, so the whole algorithm is
. (iii) The complexity of KNN is
taken from its worst-case calculation. (iv) The complexity of the serial mode of PHCAL is
. Since k, T and M are constants; it is simplified to
, which is close to McDPC but higher than k-means ++ and KNN. However, when running on the parallel model based on Spark, the complexity of PHCAL becomes
according to the Sun-Ni theorem [
30], where G(
p) is the factor,
m is the number of samples of a block (
i) and
. Notice that when the number of parallel nodes
p is sufficient (the value of
p tends to reach a certain threshold for the number of divided blocks) and the communication overhead is negligible,
; that is, the complexity is close to
.
It can be seen that the time complexity of PHCAL is obviously lower than its serial mode, in theory, if the communication delay is low enough to be ignored, which is verified by practice later. Furthermore, the speedup (S
p) and parallel efficiency (E
p) are introduced to evaluate the parallel performance of PHCAL as follows.
where
denotes the running time of the serial mode, and
denotes the running time of the parallel computing using
p nodes. Generally,
. Theoretically, the parallel performance of PHCAL will stay the same under different data sizes and hardware resources as long as the following two conditions are met. (i) The value of
p is large enough (tends to reach a certain threshold); (ii) the proportion of communication delay in the running time is very small and even negligible.