Next Article in Journal
Study on Reinforcement Learning-Based Missile Guidance Law
Next Article in Special Issue
Data Analysis as a Tool for the Application of Adaptive Learning in a University Environment
Previous Article in Journal
Artificial Intelligence Enabled Routing in Software Defined Networking
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Analysis of University Students’ Behavior Based on a Fusion K-Means Clustering Algorithm

School of Reliability and Systems Engineering, Beihang University, Beijing 100191, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2020, 10(18), 6566; https://doi.org/10.3390/app10186566
Submission received: 18 August 2020 / Revised: 16 September 2020 / Accepted: 17 September 2020 / Published: 20 September 2020
(This article belongs to the Special Issue Data Analytics and Machine Learning in Education)

Abstract

:
With the development of big data technology, creating the ‘Digital Campus’ is a hot issue. For an increasing amount of data, traditional data mining algorithms are not suitable. The clustering algorithm is becoming more and more important in the field of data mining, but the traditional clustering algorithm does not take the clustering efficiency and clustering effect into consideration. In this paper, the algorithm based on K-Means and clustering by fast search and find of density peaks (K-CFSFDP) is proposed, which improves on the distance and density of data points. This method is used to cluster students from four universities. The experiment shows that K-CFSFDP algorithm has better clustering results and running efficiency than the traditional K-Means clustering algorithm, and it performs well in large scale campus data. Additionally, the results of the cluster analysis show that the students of different categories in four universities had different performances in living habits and learning performance, so the university can learn about the students’ behavior of different categories and provide corresponding personalized services, which have certain practical significance.

1. Introduction

With the continuous development of information technology, the data accumulated in the campus information environment is gradually expanding, and a complete campus big data environment has been formed. Traditional campus management concepts and data analysis methods have been unable to meet the growing data processing needs. How to effectively manage and share campus data, use big data mining ideas to optimize student management, and provide clearer, more detailed data services for students is a problem faced by today’s campus service system.
At present, there is much research on university students’ behavior, involving multiple aspects of behavior. Some researchers analyzed the college students’ physical activities (PA) to find PA patterns and their determinants, which will help college students to notice their health conditions [1]. Belingheri examined the prevalence of smoking, binge drinking, physical inactivity, and excessive bodyweight in a population of healthcare students attending an Italian university [2]. SY Park studied the university students’ behavioral intention to use mobile learning [3]. J Kormos investigated the influence of motivational factors and self-regulatory strategies on autonomous learning behavior [4]. Most of the above studies mainly used traditional mathematical statistical methods to obtain data information, but did not further explore the laws behind the data. Besides, it can be seen from these studies that the learning behavior and life behavior of university students are the main concerns of the school. Therefore, we mainly focus on the learning performance and living habits of university students, and classify them accordingly.
Educational data mining (EDM) can discover hidden patterns and knowledge from a large amount of data. Many scholars have applied data mining to the analysis of university students’ behaviors to find the laws behind their behaviors. For example, Man Wai Lee used data mining techniques to analyze learner preferences. The study found that field independent learners frequently use backward/forward buttons and spent less time for navigation, while field dependent learners often use main menu and have more repeated visiting. [5]. Jingyi Luo used the free comments written by students at the end of each class to continuously track students’ learning and improve the prediction of students’ final grades [6]. Arat G. analyzed the relationship between risk behaviors and resilience among South Asian minority youth. They discussed the resilience-based research, practical, and social policy implications [7]. Zullig K.J. examined the association between prescription opioid misuse and suicidal ideation, suicide plan, and suicide attempt among adolescents. The research demonstrated the harmful effects of prescription opioid misuse and its association with suicidal behaviors among adolescents [8]. Natek analyzed the course scores of 106 undergraduates and used the decision tree algorithm to discover the factors that affect the grades of the undergraduates’ course [9]. S.K. Yadav applied decision tree algorithms on students’ past performance data to generate the model that can be used to predict the students’ performance. It helps earlier in identifying the dropouts and students who need special attention [10]. Some of the above studies used supervised machine learning methods. For the classification method of supervised machine learning, the “label” of university students is determined, that is, the types of university students are defined in advance, and the work done is only to “classify” by analyzing the behavior data of university students. However, in most cases, the number of types of students in a university cannot be known in advance, so unsupervised machine learning methods are needed for clustering.
Clustering algorithm is a very important technology in data mining. Clustering is the process of assigning samples to different classes. In general, in the clustering results, samples in the same category tend to have greater similarities, while samples in different classes have greater dissimilarities. The goal of clustering is to divide the data based on a specific similarity measure. Clustering algorithm is suitable for the description of students’ behaviors. Based on a cluster analysis of 320,000 students, Saenz V.B. found that there is a huge difference in the utilization rate of support services among different student groups [11]. Rapp K. performed cluster analysis on German nurse students quitting smoking [12]. OR Battaglia investigated students’ behavior by using cluster analysis as an unsupervised methodology in the field of education [13]. Some researchers have also conducted other student behavior cluster studies [14,15,16].
The above literature analysis shows that there are many papers that use data mining techniques to analyze student behavior analysis. However, there are not many papers on the analysis of college students’ behavior using clustering methods. Some papers used the clustering algorithm, which did not take into account the clustering effect and efficiency to analyze the students’ behaviors. Currently, commonly used clustering methods include: partition clustering [17], density clustering, hierarchy-based [18], model-based [19], and density-based [20]. K-Means clustering algorithm is a traditional clustering algorithm proposed by Macqueen, which is simple and efficient [21]. At the same time, it has the advantages of scalability and high efficiency for processing large data sets. The K-Means clustering algorithm has a wide range of applications [22,23,24,25,26]. Some scholars also apply K-Means to the behavior analysis of university students. For example, P.D. Antonenko illustrated the use of K-Means clustering to analyze characteristics of learning behavior while learners engage in a problem-solving activity in an online learning environment [27]. C.Y. Yang analyzed the characteristics of big data in university campus and adopted K-Means algorithm to propose an early warning system of college students’ behavior based on the Internet of Things and big data environment [28]. SE Sorour proposed a new approach based on text mining techniques for predicting student performance using latent semantic analysis and K-means clustering methods [29].
Although the K-Means algorithm has many advantages, it still has fatal flaws. These defects will greatly limit the application of K-Means. It has two major drawbacks. (1) The value of k needs to be specified in advance. The number of manually assigned clusters is inaccurate. Each type of data has its own characteristics. Randomly specifying k for unknown data does not reflect the clustering effect of the data itself. So, the clustering result has a great deal of randomness. (2) The K-Means algorithm starts randomly assigning cluster centers. Different cluster centers often result in different final clustering results. The K-Means algorithm is heavily dependent on the initial cluster center, and the final iterated cluster center is not necessarily the global optimal cluster center. At the same time, there are many improvements to the K-Means algorithm, but these two problems cannot be solved effectively. There are some algorithms to optimize the value of k. The main approach to this point improvement is to assign a number of values to k by guessing in advance and determine the best value of k by measuring the sum of the squared clustering errors. This approach has made great progress compared to before, but it is inefficient. Under certain k values, the initial clustering may exist without solutions. Therefore, multiple clusters are required to obtain the clustering result. Although it is possible to obtain better k-values, it takes a lot of time to work in the early stage and is unrealistic for dealing with large amounts of data. The common algorithm in this class is K-Means++. There is also an optimization algorithm for clustering centers. Wang W proposed a way to optimal cluster centers for detecting images [30]. The clustering by fast search and find of density peaks (CFSFDP) clustering algorithm was proposed in 2014 [31]. The clustering algorithm takes into account the characteristics of clustering centers and data. The algorithm is simple in logic, but the clustering result is much better than the traditional clustering algorithm. However, the efficiency of the algorithm is low. It is unable to meet the requirement of a large amount of data clustering time. There are also some other clustering algorithms [32,33,34,35,36,37]. Minaei-Bidgoli B. considered the effects of resampling and adaptive methods on the clustering effect [38]. Alizadeh H. selected categories based on a new cluster stability measure [39]. Parvin H. combined the ant colony algorithm and clustering algorithm [40]. Some studies of clustering algorithms considered the weight [41,42,43]. The application of fuzzy clustering is also becoming more and more widespread [44,45,46,47].
The algorithms above can get better clustering results, however, based on the recommendation system to select cluster centers, the efficiency is very low. Aiming to improve the disadvantages of both K-Means and CFSFDP in application, we proposed a new algorithm to describe the students’ behavior.
The innovations and contributions of this research are as follows:
  • We applied the relevant theories and knowledge of data mining and machine learning to the analysis of university students’ behavior. This is an application innovation in the field of machine learning and education.
  • Compared with the traditional behavior analysis model of college students, this study used data for analysis, which reduces the subjectivity of human judgment and avoids prejudice caused by preconceptions. Therefore, the analysis results of this study are more objective.
  • Most current research used the K-Means algorithm, but the number of student categories and cluster centers are difficult to determine. The K-Means and clustering by fast search and find of density peaks (K-CFSFDP) proposed in this study can automatically determine the number of student behavior types and typical representatives based on data. Therefore, K-CFSFDP has high flexibility and wide applicability, and it also avoids human intervention in the clustering process.
  • The K-CFSFDP proposed in this research did not completely rely on the CFSFDP framework, instead, it improved CFSFDP in some aspects. Therefore, the running time is shorter and the running efficiency is higher, which has application advantages in the environment of campus big data.

2. Materials and Methods

2.1. Students’ Behavior Data from Four Universities

2.1.1. Behavior Analysis Indicators

The data is obtained from 4 universities in China. For convenience, S1, S2, S3, and S4 represent 4 different universities. We mainly described the university students’ behavior on two aspects: living habit and learning performance.
The evaluation indexes of living habit are as Table 1:
The evaluation indexes of learning performance are as Table 2:

2.1.2. Data Normalization

As the dimensions of various behavior indicators are not uniform, this study used maximum and minimum standardization to normalize the data. This is a linear transformation of the original data, which can map the data values on the [0, 10] interval. The transformation function is as follows:
x * = 10 ( x m i n ) m a x m i n
where max is the maximum value of the data and min is the minimum value of the data.
After normalization, the average values of the data were calculated. Then the evaluation indexes of the living habit and learning performance were obtained, and the value range was from 1 to 10. When the value is higher, the performance is much better.
The data in 4 different universities is as Table 3:

2.1.3. Data Visualization

In order to provide more details of the data, this study used violin plots, box plots, scatter plots and scatter plot matrices to visualize the data. From the data visualization chart, we could get the distribution law and aggregation degree of the data.
The violin plot can display the distribution status and probability density of multiple sets of data. The violin chart of the data is shown in Figure 1. Green represents the living performance score, blue represents the learning performance score, and the abscissa corresponds to 4 different universities. It can be seen from the violin chart that the distribution of student behavior data in different schools is different. The living data and learning data of students in the same school are not the same. In addition, it can be intuitively seen from the figure that the student behavior data does not meet the common normal distribution, and the data is concentrated near certain values, so there is a possibility of clustering.
A box plot is a statistical chart that can display data dispersion. It can display the maximum, minimum, median, outlier, and upper and lower quartiles of a set of data. It can provide key information about the location and dispersion of data, especially when comparing different university student behavior data. The box diagram of the data is shown in Figure 2. Green represents the living performance score, blue represents the learning performance score, and the abscissa corresponds to 4 different universities. The horizontal line in the middle of the box is the median of the data, and the upper and lower ranges of the box correspond to the upper and lower quartiles of the data. The horizontal line above the box is the maximum value, and the horizontal line below the box is the minimum value. The solid points are outliers. It can be seen from the figure that, overall, the learning scores of college students in most schools were lower than living scores. In addition, the center points of most data were between 4 and 7.
The data distribution scatter diagram is shown in Figure 3. It can be seen intuitively from the figure that the data points show the phenomenon of aggregation. Some data points are very densely distributed, while others are relatively sparsely distributed. We can see very intuitively from the data distribution map of the first university that these data points were clustered into 7 categories. This shows that many students had similar living habits and learning performance and could be divided into a certain number of categories.
Further, we draw scatter plot matrices to comprehensively show the distribution of data and the shape of aggregation. The scatter plot matrix can reflect the correlation between learning scores and living scores. The matrix includes three types of charts. The upper triangle represents the scatter plot of the data, the diagonal line represents the probability density distribution map, and the lower triangle represents the contour map (the denser the data, the more concentrated the line, and the brighter the color).
The scatter plot matrices of the four universities are shown in Figure 4, Figure 5, Figure 6 and Figure 7 respectively. It can be seen from these scatter plot matrices that some data points were very densely distributed. This proves that some university students had similarities in living and study, and they gathered into a certain number of clusters. In addition, it can be seen from the figure that the number of distribution groups, the characteristics of study and life, and the degree of aggregation of student behavior in different universities were different. Therefore, the task of this research was to divide these categories using cluster analysis, so as to make a scientific and targeted classification of university student behavior. This helps school administrators to provide corresponding guidance for university students’ improvement in living and learning according to different categories.

2.1.4. Data Analysis and Algorithm Tools

The above data analysis and visualization show that the categories and distribution of student behaviors in different schools were different, so it is impossible to use a unified baseline to classify student behaviors. That is, there is no universal student classification method to label each student’s category. In other words, the existing classification methods based on expert evaluation and educational standards are not objective and accurate enough and do not consider the diversity of student categories. Therefore, this is an unsupervised machine learning problem. Clustering is an effective method to solve unsupervised learning problems. We mainly used the K-CFSFDP algorithm to analyze the students’ behaviors and compared with K-Means and CFSFDP clustering algorithms.
The K-Means clustering algorithm is a classical algorithm that is widely used in fields such as data mining and knowledge discovery. The principle of K-Means clustering algorithm is simple with high efficiency, and it is adapted to big data sets. However, it has two major drawbacks. First, the k value needs to be set in advance, in most cases, the optimal number of categories for a dataset can not be determined in advance, so it is difficult to choose a reasonable value. For example, in this study, the number of student groups in each school may not be the same. Second, in the K-Means algorithm, it is necessary to determine an initial partition based on the initial cluster center and then optimize the initial partition, and the choice of this initial cluster center has a greater impact on the clustering result. Once the initial value is not selected well, it may not be able to obtain effective clustering results. For example, we cannot find a representative from each student category, especially when the number of student categories is unknown.
We proposed the K-CFSFDP algorithm to determine the k value and cluster center based on the characters of the data set.

2.2. Clustering by K-Means

The principle of the K-Means clustering algorithm is simple. Given a data set X i ( i = 1 , 2 , 3 n ) . K value and the cluster centers are specified. Sum of the squared errors (SSE) is the objective function of the K-Means clustering algorithm
S S E = i = 1 k x C i | x x ¯ i | 2
C represents one class of the cluster result, k represents the number of categories, and x ¯ i is the average of one class. When the objective function obtains the minimum value, the clustering effect is optimal.
The step of K-Means clustering algorithm is divided into the following three steps:
Step 1: Assign samples to their nearest center vector and reduce the value of the objective function
i = 1 n m i n j { i = 1 , 2 , 3 k } | | x i p j | | 2
Distance formula between points and points adopts Euclidean distance:
d ( x i , x j ) = k = 1 d ( x i k x j k ) 2
p represents k cluster centers. d represents the attribute of x.
Step 2: Update cluster average
x ¯ i = 1 | C i | x C i X
Step 3: Calculate the objective function
When the value of the objective function is the lowest, the cluster effect is optimal.

2.3. Determining the K Value and Cluster Center

In order to correct the two drawbacks in the K-Means clustering algorithm. The paper uses the way of determining the cluster centers proposed by Alex Rodriguez and Alessandro Laio in the “CFSFDP” cluster algorithm, which is novel, simple, and fast, and it can find the right number of classifications and the global optimal clustering center according to the data. The core of the algorithm is the description of its cluster center [31].
There are two basic assumptions in the clustering algorithm:
  • Cluster centers are surrounded by neighbors with lower local density.
  • Cluster centers are at a relatively large distance from any points with a higher local density.
This clustering algorithm can be divided into four steps, which are introduced as follows.
Step 1: Calculate the local density
The dataset for clustering is   S = { x i } N i = 1 . I S = { 1 , 2 , , N } is the corresponding indicator set. d i j = d ( x i , x j ) represents a certain distance between points x i and x j . According to Cut-off kernel, the local density ρ i of data point i is defined as follows:
ρ i = j I s \ [ i ] χ ( d i j d c )
where χ ( x ) is defined as follows:
χ ( x ) = { 1 ,   x < 0 ; 0   , x 0 .
Additionally, d c is a cutoff distance that needs to be specified in advance. Based on formula (6), ρ i is the number of data points whose distance is less than d c , without regard to the number of x i itself. To some extent, the parameter d c determines the effect of this clustering algorithm. If d c is too large, the local density value of each data point will be large, resulting in low discrimination. The extreme case is that the value of d c is greater than the maximum distance of all points, so the end result of the algorithm is that all points belong to the same cluster center. If the value of d c is too small, the same group may be split into multiples. The extreme case is that d c has a smaller distance than all points, which will result in each point being a cluster center. The reference method is to select a d c so that the average number of neighbors per data point is about 1–2% of the total number of data points.
Step 2: Calculate the distance
A descending sequence of subscripts { q i } i = 1 N was generated.
ρ q 1 ρ q 2 ρ q N
The distance formula is as follows:
δ i = { m i n q j { d q i q j } ,   i 2 ; m a x j 2 { δ q j } ,   i = 1 .
For the above formula, when   i = 1 , δ i is the distance between the data point with the largest distance from x i in S. If   i > 2 , δ i represents the distance between x i and the data point (or those points) with the smallest distance from x i for all data points with a local density greater than x i .
Step 3: Select the clustering center
So far, we could get the ( ρ i , δ i ) , i I S of every data point. When considering comprehension, the following formula is used to select the clustering center:
γ i = ρ i δ i ,   i I S ,
For example, Figure 8 contained 20 data points, and the γ i = ρ i δ i , i I S of every data point could be obtained.
Next, γ was calculated to select the cluster center. Figure 9 shows the γ curve.
According to Figure 9, it can be found that for the non-cluster centers, the curve is smoother. Besides this, there is a clear jump between the cluster centers and non-cluster centers.
Step 4: Categorize other data points
After the cluster center was determined, the category labels of the remaining points were specified according to the following principles: the category label of the current point was equal to the label of the nearest point higher than the current point density, which will take much time.
The CFSFDP proposed by Alex Rodriguez and Alessandro Laio has the following defects:
1.
The density calculation adopts the cut-off kernel function, and the clustering result is very dependent on d c .
2.
The author did not provide a specific distance calculation formula. The distance measurement of different problems is not the same, the specific distance measurement formula should be determined according to the actual problem.
3.
The method of categorizing other data points is inefficient. After the cluster centers have been found, each remaining point is assigned to the same cluster as its nearest neighbor of higher density, which will cause unnecessary multiple iterations and repeated calculations. Besides, as the amount of data increases, the amount of calculation will increase sharply, resulting in a long running time of the algorithm, and it is difficult to meet the operational needs of campus big data. In addition, this classification method does not take full advantage of the determined number of clusters and cluster centers.
This study improved the above problems in the K-CFSFDP algorithm, especially the data point classification problem, so as to improve the operating efficiency of the CFSFDP algorithm.

2.4. K-CFSFDP Algorithm

Based on the way of determining the k value and cluster center, the K-CFSFDP algorithm was proposed, which mainly includes the following steps:
Step 1: Data process: data normalization:
We first used formula (1) to standardize the data. This step was implemented in the data preprocessing section.
Step 2: Calculate the density of each point:
The clustering set was   S = { x i } N i = 1 . We adopted the Gaussian kernel function to calculate the density. The formula is as follows:
ρ i = j I s \ [ i ] e ( d i j d c ) 2
I S = { 1 , 2 , , N } is an indicator set. dij = dist(xi, xj) represents the Euclidean distance between points x i and x j .
Step 3: Calculate the distance value for each point:
  • Distance formula between points and points adopted the Euclidean distance as shown in formula (4). The d i j was calculated and was d i j = d j i , i < j , i , j I S .
  • According to { ρ i } i = 1 N , we generated its descending order subscript { q i } i = 1 N .
  • We calculated the distance value { δ i } i = 1 N .
Step 4: Calculate the γ value:
Based on { ρ i } i = 1 N and { δ i } i = 1 N , { γ i } i = 1 N was calculated. The magnitude of { ρ i } i = 1 N and { δ i } i = 1 N might be different. If the difference is too large, it’s necessary to perform a normalization process.
Step 5: Determine the k value and the cluster center:
We selected the number of cluster (k value) and cluster center according to the decision graph and initialized the data point classification attribute tag { c i } i = 1 N , as follows:
c i = { t ,   t ϵ [ 1 , k ] ,     i f   x i   is   the   cluster   center   and   belongs   to   the   first   k   cluster 1 ,                                                                    other
Step 6: Use Euclidean distances to classify other points (formula 4):
For data points that were not cluster centers ( c i = 1 ), we calculated the Euclidean distance between the data point and each cluster center, selected the cluster center with the shortest distance, and classified the data point into the category to which the cluster center belongs.
Compared with CFSFDP, K-CFSFDP has achieved improvements in the following aspects:
1.
K-CFSFDP used Gaussian kernel instead of the original cut-off kernel in CFSFDP. Cut-off kernel is a discrete value while Gaussian kernel is a continuous value. Therefore, the Gaussian kernel has a smaller probability of conflict (i.e., different data points have the same local density value). In addition, the Gaussian kernel still satisfies that the more data points whose distance from x i is less than d c , the greater the value of ρ i .
2.
We clarified the measurement method of data point distance in K-CFSFDP.
2.
Using the determined number of clusters and cluster centers, this study optimized the classification of other data points. Each data point only needs to traverse the Euclidean distance to each cluster center to find the nearest cluster, without additional calculations to the distance of other non-cluster center data points. This greatly reduces the computational complexity of the algorithm. Assigning data points to the cluster center closest to it can spend less time to improve efficiency.
Compared to the original K-Means algorithm, the advantages of the K-CFSFDP algorithm is that the algorithm can automatically select the appropriate number of classes and initial cluster center based on the characteristics of the data. This can reduce human involvement in clustering.

2.5. Model Performance Metrics

In order to evaluate the performance of the clustering model, in addition to SSE and running time, we also adopted the following evaluation criteria: silhouette coefficient (SC) [48], Calinski–Harabasz index (CHI) [49], and Davies–Bouldin index (DBI) [50]. These are commonly used evaluation criteria for clustering performance measurement.

2.5.1. Silhouette Coefficient (SC)

For a good cluster, the distance between samples of the same category is very small, and the distance between samples of different categories is very large. The silhouette coefficient (SC) can evaluate both characteristics at the same time. A higher silhouette coefficient score relates to a model with better clusters.
The silhouette coefficient s for a single sample is given as:
s = b a m a x ( a , b )
where a is the mean distance between a sample and all other points in the same class and b is the mean distance between a sample and all other points in the next nearest cluster.
The silhouette coefficient for a set of samples is given as the mean of the silhouette coefficient for each sample [48].

2.5.2. Calinski–Harabasz Index (CHI)

A higher Calinski–Harabasz score relates to a model with better clusters. For a set of data E of size n E , which has been clustered into k clusters, the Calinski–Harabasz score s is defined as the ratio of the between-clusters dispersion mean and the within-cluster dispersion [49]:
s = tr ( B k ) tr ( W k ) × n E k k 1
where tr ( B k ) is trace of the between group dispersion matrix and tr ( W k ) is the trace of the within-cluster dispersion matrix defined by:
W k = q = 1 k x C q ( x c q ) ( x c q ) T
B k = q = 1 k n q ( c q c E ) ( c q c E ) T
With C q the set of points in cluster q , c q the center of cluster q , c E the center of E , and n q the number of points in cluster q .

2.5.3. Davies–Bouldin Index (DBI)

A lower Davies–Bouldin index relates to a model with better separation between the clusters. The index is defined as the average similarity between each cluster C i for i = 1 , , k and its most similar one C j [50]. In the context of this index, similarity is defined as a measure R i j that trades off:
s i ,
the average distance between each point of cluster i and the centroid of that cluster—also known as the cluster diameter.
d i j ,
the distance between cluster centroids i and j .
A simple choice to construct R i j so that it is nonnegative and symmetric is:
R i j = s i + s j d i j
Then the Davies–Bouldin index is defined as:
D B I = 1 k i = 1 k m a x i j R i j

3. Results

3.1. The Results of K-CFSFDP

First, we used the K-CFSFDP clustering algorithms to process the four kinds of data. According to this algorithm, the k value and the cluster centers can be determined. Figure 10 is the density, distance decision graph, where the colored points represent the selected cluster center points.
γ value is shown as Figure 11. According to the γ value, the γ values of the seven scattered points were relatively large, so the four kinds of data were divided into seven cluster centers. The cluster centers are shown as Table 4, and the clustering results are shown in the Figure 12.

3.2. K-Means Clustering Algorithm

According to the traditional K-Means clustering algorithm, k value needs to be specified in advance. We in turn specified the value of k, drew the sum of squares due to error (SSE) curve, and then determined the specific value of k.
Due to the possibility of no solution in the traditional K-Means clustering algorithm, when the K-Means algorithm was initialized, the clustering centers were randomly selected. In order to ensure the validity of the results of the K-Means clustering algorithm. We conducted 10 experiments for each k value of the K-Means algorithm, and took the average of each cluster center in 10 experiments as the final cluster center for each k, then calculated the SSE, and the SSE curve is shown as Figure 13.
According to Figure 13, it can be found that as the value of k increased, the SSE value decreased. When the value of k was 7, the SSE reached the minimum value, as the value of k continued to increase, the SSE gradually increased, so the best number of clusters was 7. The traditional K-Means algorithm clustering effect is shown in Figure 14.
According to the SSE curve, the best value of k was 7. So, we used the traditional K-Means clustering algorithm to cluster and chose 7 as the value of k. Table 5 is the mean value of the cluster centers under multiple clustering.

3.3. CFSFDP Cluster Algorithm

In addition to the traditional K-Means clustering algorithm, we also used the CFSFDP algorithm for clustering. Since the clustering center and the CFSFDP algorithm were consistent in this paper, the improved K-Means clustering algorithm was the same as the clustering center of the algorithm.
The clustering results are shown in Figure 15.

3.4. Evaluation and Comparison of Three Algorithms

The comparison results of each model under different evaluation criteria are shown in Table 6 and Table 7. The running time of the three algorithms is shown in Table 8.
In order to more clearly reflect the comparison results of the three models, we standardized the results in Table 6, Table 7 and Table 8 and set the value range to [0–5]. Then we visualized the results, as shown in Figure 16, Figure 17, Figure 18 and Figure 19. They show more intuitively the performance scores of three models under five performance metrics in different universities.

4. Discussion

Now we could analyze university students’ behavior based on the results shown in Table 4 and Figure 12. From Table 4, it can be found the students’ behaviors were different among four universities. S1 was taken as an example to analyze the students’ behaviors. The center point of the first category was (6.0389, 5.47474), 6.0389 was the score of living habit, and 5.47474 was the score of learning performance, the center point indicates that the living habits and learning performance of students in this category were moderate. The center point of the second category was (8.01667, 3.18542), which indicates that students in this category had good living habits but poor learning performance. The fact was adverse in the third type, the center point of the third category was (4.15417, 7.85827), which shows these students performed well in learning performance and badly in living habit.
Figure 12 shows that students in these four schools could be divided into seven categories, but the distribution of student behavior categories in different schools was different. From Figure 12, the distribution of most S1 students was located in the top right of the figure. Compared with the three other universities, it could be found that the students performed very differently between each category. Most of the S2 categories were in the lower right of the picture, which indicates that the learning performance was better, but the living habits were poor. S3 and S4 students performed moderately in living habit and learning performance.
We measured and compared the effects of the three algorithms on the behavior clustering of university students by calculating SC, CHI, DBI, and SSE. It can be seen from Table 6 and Figure 16a, Figure 17a, Figure 18a and Figure 19a that the SC and CHI of K-CFSFDP were higher than those of K-Means and CFSFDP on the four university data sets. It can be seen from Table 7 and Figure 16b, Figure 17b, Figure 18b and Figure 19b that the DBI of K-CFSFDP was lower than that of K-Means and CFSFDP on the four university data sets. According to the SSE value, it could be clearly found that the K-CFSFDP algorithm was obviously better than the traditional K-Means clustering algorithm, and the partial SSE of the K-CFSFDP algorithm was less than CFSFDP algorithm. This shows that the clustering effect of K-CFSFDP was better than that of K-Means and CFSFDP, and could better cluster the behavior of university students. The experimental results in Table 6 and Table 7 confirmed that because K-CFSFDP could determine the number of student categories (number of clusters) and student representatives (cluster centers), and used the Gaussian kernel function to calculate the point density, it has greater advantages. Therefore, compared with the other two algorithms, K-CFSFDP could better gather students with similar learning behaviors and living habits.
We compared the running time of the three algorithms based on the same data. From Table 8 and Figure 16b, Figure 17b, Figure 18b and Figure 19b, it could be found that the traditional K-Means algorithm ran faster than the other two algorithms. However, the time spent in its early stage was much longer than the other two algorithms. The running time of K-CFSFDP algorithm was shorter than CFSFDP. Therefore, K-CFSFDP could perform clustering faster and reduced the computing time of large-scale campus data.
Compared with the current studies and methods on the behavior analysis of university students, the method proposed in this paper had considerable advantages in the following aspects:
(1)
Traditional data mining and supervised machine learning methods often set labels in advance when classifying student behaviors. University student behavior data is often only used to analyze the relationship between behavior characteristics and labels. This does not take into account the diversity of student behavior and the knowledge contained in the data itself, and the evaluation criteria for labels are often subjective. For example, in accordance with the traditional supervised machine learning algorithm (such as decision trees, random forests, etc.), students with a score greater than 60 can be labeled as “good students”, and those with scores less than 60 can be labeled as “bad students”. Student behavior data and label data are input into a supervised machine learning model for training and analysis. After the training is completed, when the behavior data of a new student is input, the model will output which category the student belongs to. Obviously, this method has two problems. First, it is difficult to quantify the evaluation criteria of the label. People can question: “Why is the label threshold of good students 60 instead of 50 or 70?” Therefore, the judgment of labels is often vague, and the classification results of students are not objective. Second, the environmental conditions of different schools are different. It is possible that the difficulty of the exam of school B is greater than that of school A, and a student with a score of 50 may be defined as a “good student”. Therefore, the judgment of student labels is often more complicated and the model is difficult to adjust flexibly. The method proposed in this paper does not need to determine in advance how many types of university students there are, nor does it need to determine in advance which type a student belongs to. This method uses unsupervised clustering to automatically classify students’ behavior data based on the similarity, so the result is more objective and accurate, and it can reflect the impact of the university’s own characteristics on students’ behavior.
(2)
The traditional K-Means clustering algorithm can select the number of clusters according to the SSE value, but the number of possible classes of the data needs to be estimated in advance. This is unrealistic for unfamiliar large data sets because it is impossible to determine the number of student behavior categories in advance. Additionally, the K-Means clustering algorithm may not find the cluster center. As for CFSFDP, its computing time is relatively long, which cannot be applied to large-scale campus data. The method proposed in this paper combines the advantages of the two algorithms, which can accurately determine the number of student behavior categories and cluster centers, and can also process large-scale university student behavior data at a reasonable speed.

5. Conclusions

In this paper, the K-CFSFDP algorithm based on K-Means and CFSFDP was proposed to analyze different university students’ behaviors. We first introduced the relevant research on the behavior analysis of university students, and clarified that educational data mining was the current development trend. We noticed that clustering was an effective tool for behavior analysis of university students. We found that K-Means clustering algorithm had the disadvantage of not specifying K value and its clustering effect depended on the initial clustering center. Although CFSFDP clustering algorithm had good clustering results, its operation efficiency was low. Under the background of big data, the CFSFDP algorithm will be greatly restricted. Considering the clustering effect and running time, we constructed the K-CFSFDP algorithm as an effective tool to analyze the behavior of university students. In order to verify the effectiveness of the model, we collected data on the learning performance and living habits of 8000 students from four universities, and used K-Means, CFSFDP, and K-CFSFDP to cluster these data. We judged the clustering effect and operating efficiency through the value of SC, CHI, DBI, SSE, and the running time. Comparing and analyzing the experimental results, we could draw the following conclusions:
  • University students with similar learning performance and living habits in each university gathered into a certain number of sets.
  • Clustering centers could reflect the behavioral characteristics of a certain category of students in the areas of learning performance and living habits.
  • The distribution of behavior categories of university students in different schools was not the same.
  • The K-CFSFDP algorithm could directly specify the appropriate k value and the optimal clustering center. That is, the algorithm could determine the number of student behavior types and behavior scores of each university.
  • K-CFSFDP had a better clustering effect than K-Means algorithm, and had a shorter running time than CFSFDP algorithm, so it could be applied to the analysis of university students’ behavior.
This study had certain practical significance in education and the pedagogical aspect. Teachers or school administrators could better obtain the categories and characteristics of student behavior. The practical value and significance of this study are as follows:
First, this study could achieve a scientific and reasonable classification of university students’ behavior and was simple to operate. For university decision makers and student administrators, they do not need to judge in advance which category a student belongs to. They only need to collect students’ behavioral data, and then input them into the model for clustering and analysis. This facilitates the management process of decision makers and avoids preconceived judgments about student behavior types. Since it is based on the data clustering, the result is more objective and accurate.
Second, this study can help school administrators provide targeted assistance to different student groups. The clustering results can show the differences between each type students, which can help schools better understand student behavior. Schools can analyze and summarize the behavior characteristics of different types of students, and take targeted measures for different types of students to help them have good living habits and perform well in learning.
Third, the results of this research can provide feedback on the management effects of school administrators. For example, if a university has a small number of student groups, it means there exists a high similarity and concentrated distribution of students’ behavior. This indicates that student life on campus may be tedious. As a result, school administrators should take measures to enrich students’ campus life. If a university’s student clusters are more distributed near certain values, or less distributed near certain values, education administrators can rethink what caused the imbalance in student distribution and take targeted improvement measures. For another example, if there are a large number of students in a certain group and a small number of students in another group, education administrators can think about what caused this difference, so as to provide corresponding assistance to the minority student groups. This can avoid ignoring the need of minority student groups.
Fourth, the results of this research can help educators to further analyze what factors can affect and determine the behavior characteristics of students, such as using correlation analysis methods to study the relationship between students’ personal characteristics (such as gender, height, weight, etc.) and student behavior.
Fifth, this study provides a benchmark for the behavior classification of university students. Since K-CFSFDP determines the student behavior category, it is equivalent to providing a classification baseline about sample label. Therefore, we can use supervised machine learning methods to analyze university student behavior.
This research was an application innovation of university student behavior analysis. There were application scope and applicable conditions. The application circumstances and data requirements of the algorithm proposed in this study were as follows:
1.
University student behavior data was structured data.
Structured data refers to data that can be represented and stored in a relational database, and is represented in a two-dimensional form. The general characteristics are: the unit of data is “row”, a row of data represents the information of an entity, and the attributes of each row of data are the same. In this study, each row of the data represented each student, and each column was the student’s behavioral attributes (such as learning habits and living habits).
2.
The behavior classification of college students was an unsupervised learning problem.
Machine learning can be divided into supervised learning and unsupervised learning. There are data and labels in supervised learning, and machine learning can learn a function that maps data to labels. There are many forms of label definition, such as classifying students as “good students” or “bad students” based on the threshold of test scores. The data for unsupervised learning has no labels. In this study, we obtained student behavior data through statistics without any labels. That is, we did not define in advance which categories the students belong to, nor did we define the characteristics of each category, so it was an unsupervised learning problem. This means that from the distribution map of student behavior data (the abscissa is the learning score, the ordinate is the living score), we could not intuitively judge how many categories these data could be divided into, nor could we judge the typical representative data points of each category. We used K-CFSFDP to automatically classify students based on the similarity of data.
3.
The scale of college students’ behavior data was relatively large.
First, the collection, calculation, and storage of big data are huge. Second, the dimensionality of the data is higher. Third, the data growth rate is very fast, and the data acquisition and processing speed is required to be fast. Fourth, the data value density is relatively low. In this study, the number of students in each university was often very large, reaching tens of thousands. This study counted data of a total of 8000 students, which had a certain number scale. In addition, we counted the data of each student’s study and living habits. There were eight evaluation indicators (as shown in Table 1 and Table 2), and the data had a certain dimension.
The study in this paper had the following indications for further research. First, this paper only analyzed student behavior from two dimensions of living habits and learning performance. The behavior of university students has multiple dimensions, such as social behavior, network behavior, etc. Future research will expand the dimensions of student behavior and test the clustering effect of K-CFSFDP on high-dimensional student behavior data. Second, this study only used data from four universities, so the number of data sets was small. In future research, we will investigate more universities to expand the number of data sets, so that we can use a statistical test to analyze and compare the clustering results of different methods. Third, each clustering algorithm has its own distance metric function. Each distance metric function is not suitable for all data. The K-CFSFDP algorithm in this paper is still using Euclidean distance. Different distance metrics should be adopted for different data characteristics. Fourth, the cutoff distance d c in CFSFDP has a significant impact on the clustering results. The K-CFSFDP did not further optimize the d c . Future study will explore how to choose the best d c in K-CFSFDP.

Author Contributions

Conceptualization, X.J.; methodology, X.J.; software, X.J.; validation, Y.L., Y.X., and W.C.; formal analysis, X.J., H.L., and B.C.; investigation, W.C.; resources, W.C.; data curation, Y.L. and B.C.; writing—original draft preparation, X.J.; writing—review and editing, S.Z.; visualization, X.J.; supervision, W.C.; project administration, S.Z.; funding acquisition, S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (Grant Nos. 71971013, 71871003, and 71501007). The study is also sponsored by the Fundamental Research Funds for the Central Universities (YWF-20-BJ-J-943) and the Graduate Student Education and Development Foundation of Beihang University.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Keating, X.D.; Guan, J.; Castro-Piñero, J.; Bridges, D.M. A Meta-Analysis of College Students’ Physical Activity Behaviors. J. Am. Coll. Health 2005, 54, 116–125. [Google Scholar] [CrossRef] [PubMed]
  2. Belingheri, M.; Facchetti, R.; Scordo, F.; Butturini, F.; Turato, M.; De Vito, G.; Cesana, G.; Riva, M.A. Risk behaviors among Italian healthcare students: A cross-sectional study for health promotion of future healthcare workers. La Med. del Lav. 2019, 110, 155–162. [Google Scholar]
  3. Park, S.Y.; Nam, M.-W.; Cha, S.-B. University students’ behavioral intention to use mobile learning: Evaluating the technology acceptance model. Br. J. Educ. Technol. 2012, 43, 592–605. [Google Scholar] [CrossRef]
  4. Kormos, J.; Csizér, K. The Interaction of Motivation, Self-Regulatory Strategies, and Autonomous Learning Behavior in Different Learner Groups. TESOL Q. 2014, 48, 275–299. [Google Scholar] [CrossRef]
  5. Lee, M.W.; Chen, S.Y.; Chrysostomou, K.; Liu, X. Mining students’ behavior in web-based learning programs. Expert Syst. Appl. 2009, 36, 3459–3464. [Google Scholar] [CrossRef]
  6. Luo, J.; Sorour, S.E.; Goda, K.; Mine, T. Predicting Student Grade Based on Free-Style Comments Using Word2Vec and ANN by Considering Prediction Results Obtained in Consecutive Lessons. Int. Educ. Data Min. Soc. 2015, 396–399. [Google Scholar]
  7. Arat, G.; Wong, P.W. Examining the Association Between Resilience and Risk Behaviors Among South Asian Minority Students in Hong Kong: A Quantitative Study. J. Soc. Serv. Res. 2019, 45, 360–372. [Google Scholar] [CrossRef]
  8. Zullig, K.J.; Divin, A.L. The association between non-medical prescription drug use, depressive symptoms, and suicidality among college students. Addict. Behav. 2012, 37, 890–899. [Google Scholar] [CrossRef]
  9. Natek, S.; Zwilling, M. Student data mining solution–knowledge management system related to higher education institutions. Expert Syst. Appl. 2014, 41, 6400–6407. [Google Scholar] [CrossRef]
  10. Yadav, S.K.; Bharadwaj, B.; Pal, S. Data mining applications: A comparative study for predicting student’s performance. arXiv 2012. [Google Scholar]
  11. Saenz, V.B.; Hatch, D.; Bukoski, B.E.; Kim, S.; Lee, K.-H.; Valdez, P. Community College Student Engagement Patterns. Community Coll. Rev. 2011, 39, 235–267. [Google Scholar] [CrossRef]
  12. Rapp, K.; Büchele, G.; Jähnke, A.G.; Weiland, S.K. A cluster-randomized trial on smoking cessation in German student nurses. Prev. Med. 2006, 42, 443–448. [Google Scholar] [CrossRef] [PubMed]
  13. Battaglia, O.R.; Di Paola, B.; Fazio, C. A New Approach to Investigate Students’ Behavior by Using Cluster Analysis as an Unsupervised Methodology in the Field of Education. Appl. Math. 2016, 7, 1649–1673. [Google Scholar] [CrossRef] [Green Version]
  14. Quintiliani, L.M.; Allen, J.; Marino, M.; Kelly-Weeder, S.; Li, Y. Multiple health behavior clusters among female college students. Patient Educ. Couns. 2010, 79, 134–137. [Google Scholar] [CrossRef] [Green Version]
  15. Head, M.; Ziolkowski, N. Understanding student attitudes of mobile phone features: Rethinking adoption through conjoint, cluster and SEM analyses. Comput. Hum. Behav. 2012, 28, 2331–2339. [Google Scholar] [CrossRef]
  16. Patton, G.; Bond, L.; Carlin, J.B.; Thomas, L.; Butler, H.; Glover, S.; Catalano, R.; Bowes, G. Promoting Social Inclusion in Schools: A Group-Randomized Trial of Effects on Student Health Risk Behavior and Well-Being. Am. J. Public Health 2006, 96, 1582–1587. [Google Scholar] [CrossRef]
  17. Cilibrasi, R.; Vitanyi, P.M.B. A Fast Quartet tree heuristic for hierarchical clustering. Pattern Recognit. 2011, 44, 662–677. [Google Scholar] [CrossRef] [Green Version]
  18. Mirzaei, A.; Rahmati, M. A Novel Hierarchical-Clustering-Combination Scheme Based on Fuzzy-Similarity Relations. IEEE Trans. Fuzzy Syst. 2009, 18, 27–39. [Google Scholar] [CrossRef]
  19. Xiao, J.; Xu, Q.; Wu, C.; Gao, Y.; Hua, T.; Xu, C. Performance Evaluation of Missing-Value Imputation Clustering Based on a Multivariate Gaussian Mixture Model. PLoS ONE 2016, 11, e0161112. [Google Scholar] [CrossRef]
  20. Wang, X.; Liu, G.; Li, J.; Nees, J.P. Locating Structural Centers: A Density-Based Clustering Method for Community Detection. PLoS ONE 2017, 12, e0169355. [Google Scholar] [CrossRef] [Green Version]
  21. Peng, K.; Leung, V.C.M.; Huang, Q. Clustering Approach Based on Mini Batch Kmeans for Intrusion Detection System Over Big Data. IEEE Access 2018, 6, 11897–11906. [Google Scholar] [CrossRef]
  22. Niukkanen, A.; Arponen, O.; Nykänen, A.; Masarwah, A.; Sutela, A.; Liimatainen, T.; Vanninen, R.; Sudah, M. Quantitative Volumetric K-Means Cluster Segmentation of Fibroglandular Tissue and Skin in Breast MRI. J. Digit. Imaging 2018, 31, 425–434. [Google Scholar] [CrossRef] [PubMed]
  23. Yuhui, P.; Yuan, Z.; Huibao, Y. Development of a representative driving cycle for urban buses based on the K-means cluster method. Clust. Comput. 2019, 22, 6871–6880. [Google Scholar] [CrossRef]
  24. Slamet, C.; Rahman, A.; Ramdhani, M.A.; Darmalaksana, W. Clustering the verses of the Holy Qur’an using K-means algorithm. Asian J. Inf. Technol. 2016, 15, 5159–5162. [Google Scholar]
  25. Huang, X.; Ye, Y.; Zhang, H. Extensions of Kmeans-Type Algorithms: A New Clustering Framework by Integrating Intracluster Compactness and Intercluster Separation. IEEE Trans. Neural Netw. Learn. Syst. 2013, 25, 1433–1446. [Google Scholar] [CrossRef] [PubMed]
  26. Liu, C.-L.; Chang, T.-H.; Li, H.-H. Clustering documents with labeled and unlabeled documents using fuzzy semi-Kmeans. Fuzzy Sets Syst. 2013, 221, 48–64. [Google Scholar] [CrossRef]
  27. Antonenko, P.D.; Toy, S.; Niederhauser, D.S. Using cluster analysis for data mining in educational technology research. Educ. Technol. Res. Dev. 2012, 60, 383–398. [Google Scholar] [CrossRef]
  28. Yang, C.Y.; Liu, J.Y.; Huang, S. Research on EARLY warning system of college students’ behavior based on big data environment. ISPRS-Int. Arch. Photogramm. Remote. Sens. Spat. Inf. Sci. 2020, 42, 659–665. [Google Scholar] [CrossRef] [Green Version]
  29. Sorour, S.E.; Mine, T.; Goda, K.; Hirokawax, S. A Predictive Model to Evaluate Student Performance. J. Inf. Process. 2015, 23, 192–201. [Google Scholar] [CrossRef]
  30. Wang, W.; Song, W.; Liu, S.-X.; Zhang, Y.; Zheng, H.-Y.; Tian, W. A cloud detection algorithm for MODIS images combining Kmeans clustering and multi-spectral threshold method. Guang pu xue yu guang pu fen xi = Guang pu 2011, 31, 1061–1064. [Google Scholar]
  31. Rodríguez, A.; Laio, A. Clustering by fast search and find of density peaks. Science 2014, 344, 1492–1496. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  32. Cuell, C.; Bonsal, B. An assessment of climatological synoptic typing by principal component analysis and kmeans clustering. Theor. Appl. Clim. 2009, 98, 361–373. [Google Scholar] [CrossRef]
  33. Liu, Z.; Guo, Z.; Tan, M. Constructing Tumor Progression Pathways and Biomarker Discovery with Fuzzy Kernel Kmeans and DNA Methylation Data. Cancer Inform. 2008, 6, 1–7. [Google Scholar] [CrossRef] [PubMed]
  34. Rashidi, F.; Nejatian, S.; Parvin, H.; Rezaie, V. Diversity based cluster weighting in cluster ensemble: An information theory approach. Artif. Intell. Rev. 2019, 52, 1341–1368. [Google Scholar] [CrossRef]
  35. Deng, X.; Chen, J.; Li, H.; Han, P.; Yang, W. Log-cumulants of the finite mixture model and their application to statistical analysis of fully polarimetric UAVSAR data. Geo-Spat. Inf. Sci. 2018, 21, 45–55. [Google Scholar] [CrossRef]
  36. Mojarad, M.; Parvin, H.; Nejatian, S.; Rezaie, V.; Rezaei, V. Consensus Function Based on Clusters Clustering and Iterative Fusion of Base Clusters. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 2019, 27, 97–120. [Google Scholar] [CrossRef]
  37. Abbasi, S.-O.; Nejatian, S.; Parvin, H.; Rezaie, V.; Bagherifard, K. Clustering ensemble selection considering quality and diversity. Artif. Intell. Rev. 2019, 52, 1311–1340. [Google Scholar] [CrossRef]
  38. Bidgoli, B.M.; Parvin, H.; Alinejad-Rokny, H.; Alizadeh, H.; Punch, W.F. Effects of resampling method and adaptation on clustering ensemble efficacy. Artif. Intell. Rev. 2014, 41, 27–48. [Google Scholar] [CrossRef]
  39. Alizadeh, H.; Minaei-Bidgoli, B.; Parvin, H. Cluster ensemble selection based on a new cluster stability measure1. Intell. Data Anal. 2014, 18, 389–408. [Google Scholar] [CrossRef] [Green Version]
  40. Parvin, H.; Beigi, A.; Mozayani, N. A clustering ensemble learning method based on the ant colony clustering algorithm. Int. J. Appl. Comput. Math. 2012, 11, 286–302. [Google Scholar] [CrossRef]
  41. Parvin, H.; Minaei-Bidgoli, B.; Alinejad-Rokny, H.; Punch, W.F. Data weighing mechanisms for clustering ensembles. Comput. Electr. Eng. 2013, 39, 1433–1450. [Google Scholar] [CrossRef]
  42. Parvin, H.; Minaei-Bidgoli, B. A clustering ensemble framework based on elite selection of weighted clusters. Adv. Data Anal. Classif. 2013, 7, 181–208. [Google Scholar] [CrossRef]
  43. Nazari, A.; Dehghan, A.; Nejatian, S.; Rezaie, V.; Parvin, H. A comprehensive study of clustering ensemble weighting based on cluster quality and diversity. Pattern Anal. Appl. 2019, 22, 133–145. [Google Scholar] [CrossRef]
  44. Mojarad, M.; Nejatian, S.; Parvin, H.; Mohammadpoor, M. A fuzzy clustering ensemble based on cluster clustering and iterative Fusion of base clusters. Appl. Intell. 2019, 49, 2567–2581. [Google Scholar] [CrossRef]
  45. Bagherinia, A.; Bidgoli, B.M.; Hossinzadeh, M.; Parvin, H. Elite fuzzy clustering ensemble based on clustering diversity and quality measures. Appl. Intell. 2019, 49, 1724–1747. [Google Scholar] [CrossRef]
  46. Parvin, H.; Minaei-Bidgoli, B. A clustering ensemble framework based on selection of fuzzy weighted clusters in a locally adaptive clustering algorithm. Pattern Anal. Appl. 2015, 18, 87–112. [Google Scholar] [CrossRef]
  47. Zhao, W.; Yan, L.; Zhang, Y. Geometric-constrained multi-view image matching method based on semi-global optimization. Geo-Spat. Inf. Sci. 2018, 21, 115–126. [Google Scholar] [CrossRef] [Green Version]
  48. Peter, R.J.; Mathematics, A. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1999, 20, 53–65. [Google Scholar]
  49. Calinski, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat.-Theory Meth. 1974, 3, 1–27. [Google Scholar] [CrossRef]
  50. Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, 224–227. [Google Scholar] [CrossRef]
Figure 1. Violin plot of data from 4 universities.
Figure 1. Violin plot of data from 4 universities.
Applsci 10 06566 g001
Figure 2. Box plot of data from 4 universities.
Figure 2. Box plot of data from 4 universities.
Applsci 10 06566 g002
Figure 3. Data distribution scatter plot of 4 universities.
Figure 3. Data distribution scatter plot of 4 universities.
Applsci 10 06566 g003
Figure 4. The scatter plot matrix of S1.
Figure 4. The scatter plot matrix of S1.
Applsci 10 06566 g004
Figure 5. The scatter plot matrix of S2.
Figure 5. The scatter plot matrix of S2.
Applsci 10 06566 g005
Figure 6. The scatter plot matrix of S3.
Figure 6. The scatter plot matrix of S3.
Applsci 10 06566 g006
Figure 7. The scatter plot matrix of S4.
Figure 7. The scatter plot matrix of S4.
Applsci 10 06566 g007
Figure 8. Example and schematic: (A) distribution of random points and (B) the ρ and δ values of each point.
Figure 8. Example and schematic: (A) distribution of random points and (B) the ρ and δ values of each point.
Applsci 10 06566 g008
Figure 9. γ value descending arrangement of the schematic.
Figure 9. γ value descending arrangement of the schematic.
Applsci 10 06566 g009
Figure 10. Density, distance decision graph of 4 universities: (a) S1; (b) S2; (c) S3; and (d) S4.
Figure 10. Density, distance decision graph of 4 universities: (a) S1; (b) S2; (c) S3; and (d) S4.
Applsci 10 06566 g010
Figure 11. γ value.
Figure 11. γ value.
Applsci 10 06566 g011
Figure 12. The results graph of K-CFSFDP: (a) S1; (b) S2; (c) S3; and (d) S4.
Figure 12. The results graph of K-CFSFDP: (a) S1; (b) S2; (c) S3; and (d) S4.
Applsci 10 06566 g012aApplsci 10 06566 g012b
Figure 13. Sum of squares due to error (SSE) curve of K-Means algorithm from 4 universities: (a) S1; (b) S2; (c) S3; and (d) S4.
Figure 13. Sum of squares due to error (SSE) curve of K-Means algorithm from 4 universities: (a) S1; (b) S2; (c) S3; and (d) S4.
Applsci 10 06566 g013aApplsci 10 06566 g013b
Figure 14. The results graph of the K-Means cluster algorithm: (a) S1; (b) S2; (c) S3; and (d) S4.
Figure 14. The results graph of the K-Means cluster algorithm: (a) S1; (b) S2; (c) S3; and (d) S4.
Applsci 10 06566 g014
Figure 15. The results graph of the CFSFDP algorithm: (a) S1; (b) S2; (c) S3; and (d) S4.
Figure 15. The results graph of the CFSFDP algorithm: (a) S1; (b) S2; (c) S3; and (d) S4.
Applsci 10 06566 g015aApplsci 10 06566 g015b
Figure 16. Clustering performances of three models under different evaluation criteria in dataset S1: (a) the higher the value, the better and (b) the lower the value, the better.
Figure 16. Clustering performances of three models under different evaluation criteria in dataset S1: (a) the higher the value, the better and (b) the lower the value, the better.
Applsci 10 06566 g016
Figure 17. Clustering performances of three models under different evaluation criteria in dataset S2: (a) the higher the value, the better and (b) the lower the value, the better.
Figure 17. Clustering performances of three models under different evaluation criteria in dataset S2: (a) the higher the value, the better and (b) the lower the value, the better.
Applsci 10 06566 g017
Figure 18. Clustering performances of three models under different evaluation criteria in dataset S3: (a) the higher the value, the better and (b) the lower the value, the better.
Figure 18. Clustering performances of three models under different evaluation criteria in dataset S3: (a) the higher the value, the better and (b) the lower the value, the better.
Applsci 10 06566 g018
Figure 19. Clustering performances of three models under different evaluation criteria in dataset S4: (a) the higher the value, the better and (b) the lower the value, the better.
Figure 19. Clustering performances of three models under different evaluation criteria in dataset S4: (a) the higher the value, the better and (b) the lower the value, the better.
Applsci 10 06566 g019
Table 1. Evaluation indexes of living habit.
Table 1. Evaluation indexes of living habit.
IndexTypeNote
Regular dietNumerical valueThe number of days per month
Physical ExerciseNumerical valueThe number of days per month
Regular restNumerical valueThe number of days per month
Normal consumptionNumerical valueThe number of days per month
Table 2. Evaluation indexes of learning performance.
Table 2. Evaluation indexes of learning performance.
IndexTypeNote
Average scoreNumerical valueThe number of scores per month
Attendance rateNumerical valueThe number of attendances per month
study-timeNumerical valueThe hours of studying per month
Book readingNumerical valueThe number of books per month
Table 3. Students’ data from 4 universities (partial data).
Table 3. Students’ data from 4 universities (partial data).
NumberS1S2S3S4
16.073685.557427.667236.007857.556217.505246.673367.32494
26.140575.661596.557641.743918.178917.657296.005776.74123
37.229195.994265.685713.415238.113478.218145.874136.55677
46.123855.814325.558172.789667.607528.227836.974217.36017
19978.692871.678296.115353.341449.661324.193524.554795.19746
19987.481252.529365.782112.661477.886314.751788.569783.67127
19998.363861.642755.669852.407257.258124.998513.332144.87413
20008.455744.237277.614593.427897.332754.632476.112673.39741
Table 4. The cluster centers of K-Means and clustering by fast search and find of density peaks (K-CFSFDP).
Table 4. The cluster centers of K-Means and clustering by fast search and find of density peaks (K-CFSFDP).
CenterS1S2S3S4
16.032895.744148.321616.386825.163156.041576.485467.47637
28.016693.185515.649662.473075.334992.755433.035816.96716
34.154227.858312.569647.409172.283593.476407.312364.29811
48.213407.333965.425442.324154.409994.068386.688117.70098
58.546621.661758.114252.418167.649647.924415.336915.16572
63.373405.610494.444626.112613.622455.931396.675772.67338
71.665173.492637.417924.917897.597994.576993.724213.66076
Table 5. Cluster centers of the K-Means cluster algorithm.
Table 5. Cluster centers of the K-Means cluster algorithm.
CenterS1S2S3S4
18.274002.399302.593267.401724.462834.083957.545764.89052
24.133157.682355.449324.332285.464312.869676.658322.67032
37.775907.662458.360846.381257.681754.506735.545685.14083
41.710743.491105.676782.426587.688207.878824.088925.44782
53.378125.622454.447526.112933.495625.958163.764423.64857
66.066105.734127.416034.883912.304573.416822.829037.02875
78.032603.217688.088832.365145.158966.106796.458917.35572
Table 6. Silhouette coefficient (SC) and Calinski–Harabasz index (CHI) value of the three algorithms.
Table 6. Silhouette coefficient (SC) and Calinski–Harabasz index (CHI) value of the three algorithms.
CriteriaSCCHI
UniversityK-MeansCFSFDPK-CFSFDPK-MeansCFSFDPK-CFSFDP
S10.7201630.7736150.7826211418654.6617171.1720351.42166
S20.6298510.6538920.672945058548.0277939.6549330.545438
S30.5110180.5515140.563565864042.8194053.9114316.641802
S40.5164450.5262250.560557533389.2933115.9093572.915619
Table 7. Davies–Bouldin index (DBI) and sum of the squared errors (SSE) value of the three algorithms.
Table 7. Davies–Bouldin index (DBI) and sum of the squared errors (SSE) value of the three algorithms.
CriteriaDBISSE
UniversityK-MeansCFSFDPK-CFSFDPK-MeansCFSFDPK-CFSFDP
S10.472140.2830850.27428561168.6821169.2103168.5100431
S20.5735740.4273710.41893048159.0415157.6216158.5206654
S30.7233780.5810620.56537985149.0415148.9562142.2402269
S40.7516870.6117660.56924453165.5415156.224163.5586647
Table 8. Running time(s) of the three algorithms.
Table 8. Running time(s) of the three algorithms.
Type AlgorithmS1S2S3S4
K-CFSFDP8.6673267.0112599.0439498.7854173
K-Means0.4358331.0911940.8319460.823445
CFSFDP12.74059210.26954013.29349812.880990

Share and Cite

MDPI and ACS Style

Chang, W.; Ji, X.; Liu, Y.; Xiao, Y.; Chen, B.; Liu, H.; Zhou, S. Analysis of University Students’ Behavior Based on a Fusion K-Means Clustering Algorithm. Appl. Sci. 2020, 10, 6566. https://doi.org/10.3390/app10186566

AMA Style

Chang W, Ji X, Liu Y, Xiao Y, Chen B, Liu H, Zhou S. Analysis of University Students’ Behavior Based on a Fusion K-Means Clustering Algorithm. Applied Sciences. 2020; 10(18):6566. https://doi.org/10.3390/app10186566

Chicago/Turabian Style

Chang, Wenbing, Xinpeng Ji, Yinglai Liu, Yiyong Xiao, Bang Chen, Houxiang Liu, and Shenghan Zhou. 2020. "Analysis of University Students’ Behavior Based on a Fusion K-Means Clustering Algorithm" Applied Sciences 10, no. 18: 6566. https://doi.org/10.3390/app10186566

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop