1. Introduction
Academic performance is crucial for evaluating the level of universities. In the mainstream university leaderboards, the academic performance of a university is usually quantified as various statistical indicators, e.g., the number of published papers, the amount of research funding and so on. Our previous work [
1] has researched the effects of different academic indicators and proposed a new evaluation method of the university academic level based on statistical manifolds. In addition, we have conducted studies on the academic growing potential of individuals [
2]. During our research, we noticed that although there had been quite a lot of work on the design of evaluation criteria for academic level rating in a period [
3,
4,
5,
6], not so much attention has been paid to the analysis of academic growth potential. In other words, previous work only focused on the academic level comparison among different universities but lacked the excavation of academic-level development with time for the single school. As a matter of fact, the academic growing potential can serve as the basis of policies as well as one more reference for university evaluation, just as what trend analysis can do in the fields of finance, energy, and other industries. The academic development can be represented by the variation trend of specified statistical indicators, which is the main research object of this article.
As a matter of fact, the study of variation patterns of university academic indicators is a typical problem of short time series data analysis. Time series data is a group of sampled sequential data points from a continuous process over time. The analysis of time series data, especially the short one, has been considered one of the most challenging problems in the area of data mining [
7]. The first challenge is that it cannot be certain that a piece of time series data consists of enough information to fully describe the real-world process. That is why it is thought that financial markets cannot be predicted [
8]. The second, time series data, is often nonstationary, which indicates that the statistics of the time series data, such as mean, variance, and so on, change over time. This requires extra techniques or input data to solve the problem correctly. Moreover, as a sampling of real-world processes, the time series data inevitably contains much noise and often has high dimensionality. These all add up to the difficulties of time series analysis. University academic indicators are usually recorded every year, but the work of recording does not have a long history, and hence the available data is still limited. This may explain why there are hardly any related researches.
Being challenging yet promising, research on approaches for time series data analysis have been active for decades [
9]. Traditional approaches mainly focus on fitting the time series data on known models, such as the linear dynamical model [
10], regressive model [
11], hidden Markov model [
12] and ARIMA model [
13]. With the development of computing power and neural networks theory, nowadays methods based on deep learning are popular and obtain state-of-the-art results in various tasks [
14,
15]. Our previous work has also gained satisfying prediction models of deep learning [
16]. Unluckily, both traditional and modern methods cannot achieve satisfying results on short time series data. Traditional methods cannot correctly give the results when data consists of much noise, which is common for time series data. Furthermore, deep learning methods require data with enough length to extract features; otherwise, it has even worse performance than purely statistical methods [
17].
As an emerging area for complicated data processing, topological data analysis (TDA) is an overlap between mathematics and computer science and has been used in biology [
18,
19], robotics [
20,
21], finance [
22,
23], etc. In recent years, TDA for time series data analysis has been growing quickly, and one of the promising methods is persistent homology. By applying persistent homology on data clouds, persistence diagrams can be produced and considerable features can be provided. Previous work has proved the potential of persistent homology in extracting features for time series [
24], yet no research on short time series has been published.
To address the problem of university academic indicator prediction, this paper proposes to use TDA, or persistent homology exactly, as the feature extractor to reveal the time series variation patterns. Then, support vector machine (SVM) is used as a classifier to judge the variation trend of indicators. The simulation results show advantage over the classic method Markov chain. By comparing with the traditional model Markov chain, our work proves the efficiency of persistent homology in processing short time series data and capturing variation features. Moreover, by applying the model, we give the prediction of academic indicators of the top universities in mainland China, which could be a reference for other academic evaluation researches.
The paper is organized as follows. In
Section 2, we introduce the mathematical basis of TDA, including simplexes and the idea of persistent homology. We also describe our data processing strategies and make necessary validation from the statistical perspective. In
Section 3, we first give an overview of the Markov chain, and then perform simulations and give results as the baseline of prediction. In
Section 4, we simply give an overview of previous work applying TDA and then describe the simulation and results of using persistent homology.
2. Preliminary
Topological data analysis (TDA) is an emerging and rapidly developing field that provides a set of new topological and geometric tools to infer relevant features of potentially complex data. In this section, we briefly introduce some mathematical foundations of TDA and data preprocessing.
2.1. Simplicial Homology
Now, we first introduce the related concept of simplicial homology, which is the basis of persistent homology.
The natural domain of definition for simplicial homology is a class of spaces we call
-complexes, which are a mild generalization of the more classical notion of a simplicial complex [
25].
Definition 1. A Δ
-complex structure on a space X is a collection of mapswhere is a standard n-complex, such that - (i)
the restriction is injective, and each point of X is in the image of exactly one such restriction , where the open simplex is , the interior of ;
- (ii)
each restriction of to a face of is one of the maps . Here, we are identifying the face of with by the canonical linear homeomorphism between them that preserves the ordering of the vertices; and
- (iii)
a set is open iff is open in for each .
Definition 2. The simplicial chain group of X is defined aswhere are almost all zero. Definition 3. Define the chain map (boundary homomorphism)via α such that and , where the hat symbol over indicates that this vertex is deleted from the sequence . Remark 1. By direct calculation, we can see that .
With the above preparations, we can give the definition of the simplicial homology group of X.
Definition 4. The n-th simplicial homology group of X is defined as The dimension of is called the n-th Betty number. Simplicial homology groups and Betty numbers are topological invariants. A Betty number can represent some topological properties of topological spaces. For instance, the 0-th Betty number counts the connected components, the 1-th Betty number represents the number of holes and the 2-th Betty number computes the numbers of voids.
2.2. Persistent Homology
Persistent homology is a method in TDA that can efficiently study the topological features of simplicial complexes and topological spaces. It lets us leave our data in the original high-dimensional space and tells us how many clusters are in the data, and how many looplike structures there are in the data, all without being able to actually see it. The idea of persistent homology is to observe how the simplicial homology changes during a given filteration [
26,
27].
Definition 5. Given dimension n, if there is an inclusion map i of one topological space X to another Y, then it induces an inclusion map on the n-dimensional simplicial chain groups Furthermore, this extends to a homomorphism on simplicial homology groupwhere sends to the class in . Definition 6. A filtration of a simplicial complex K is a nested family of subcomplexes , where , such that for any , if then , and . The subset T may be either finite or infinite. More generally, a filtration of a topological space is a nested family of subspaces , where , such that for any , if , then and .
For applying persistent homology in a point cloud P, there are the following steps.
Step 1: Convert point cloud P to a topological space.
Here, we use VR complex. For given and metric d in P, the VR complex is the topological space containing different dimensional simplex whose maximum distance among vertices is less than or equal to .
Step 2: Construct a filtration of topological spaces.
A filtration
induces a sequence of homomorphisms on the simplicial homology groups
A class is said to be born at i if it is not in . The same class dies at j if , but .
Step 3: Obtain the resulting information.
Given a filtration Filt
of a topological space, the homology of
changes as
r increases. New connected components can appear, existing components can merge, loops and cavities can appear or be filled, etc.. Persistent homology tracks these changes, identifies the appearing features and associates a lifetime to them. We mark a point in
at
if one class is born at
i and dies at
j. Hence, we can obtain a persistence diagram by its collection of off-diagonal points
Figure 1 is an example of a persistence diagram.
The lifetime or barcode of a point
in
D is given by
. The collection of all barcodes is called persistence. The persistence of a dataset contains important topological information about its intrinsic space. In one persistence, long barcodes are interpreted as true topological features of the intrinsic space, whereas short barcodes are interpreted as topological noise. The quantitative discussion of length can be found in [
28].
More details on persistent homology can be found in reference [
29].
2.3. Data Description and Preprocessing
The data used in this paper is provided by the CNKI analysis platform of Chinese university academic achievements [
30]. We select the top 50 Chinese mainland universities in terms of scientific research funding in 2021. The names and abbreviations of the 50 universities are listed in
Table 1. For each university, we collect six types of its academic indicators from 2010 to 2019, i.e., the number of published papers of SCI and SSCI, the number of state-level funds, the amount of National Natural Science funds, and the number of applicated and authorized patents. We choose these indicators because they are strictly produced and recorded once a year, and they can comprehensively represent the academic level of universities.
An important issue for conventional time series data analysis is the validation of stationarity. A stationary time series is one in which unconditional joint probability distribution does not change over time. Stationarity validation is necessary because many statistical models assume that time series data is stationary, and analysis on nonstationary time series data could result in spurious regression, which means the time series has no relationship with the predicted trend.
One of the popular approaches for stationarity validation is the unit root test (URT) [
31]. The null hypothesis of URT is that the unit root exists, i.e., the time series is nonstationary. We choose augmented Dickey–Fuller (ADF) test, which is one of the broadly used methods for URT, to validate the stationarity of our data, i.e., the six categories of academic indicators from 2010 to 2019 of the 50 universities. The implementation is provided by Python API
statsmodels.tsa.stattools.adfuller. The API reads the time series data and returns the
p-value, which is the confidence of accepting the null hypothesis of URT. The result of ADF test on the original data is displayed in
Figure 2. We can see that most of the samples have a
p-value that supports the null hypothesis; hence, we cannot directly use the raw data for analysis.
To address the problem of nonstationarity, we propose to convert time series into its chain indexes, which is a technique usually used in economics [
32]. The
n-th chain index
is defined as
, in which
is the
n-th raw data point. An example is given in
Table 2.
For our data, every time series sequence contains 10 points. We calculate the chain indexes for each sequence respectively and then perform ADF test on the chain index sequence. The result is shown in
Figure 3. We can see that the processed data mostly meets the requirement of time series analysis, and only about 30 samples have
p-value bigger than 0.1, which are excluded to ensure the whole dataset is stationary.
5. Conclusions and Future Work
Based on the fact that the prediction of university academic indicator variation trends is hardly studied, this paper proposes to obtain time series patterns by using persistent homology. We use classic TDA pipeline methods to extract features from raw data and SVM to make predictions. The results show that TDA methods have an obvious advantage over the conventional statistical Markov chain method in terms of accuracy and F1-score, which indicates that TDA methods can fully capture the variation patterns. Our work proves the great potential of persistent homology in the field of short time series data analysis. The prediction results also provide a new perspective for evaluating the academic performance development of universities. Compared to the previous work based on conventional statistical and bibliometrics methods [
47], our work has a solid foundation of mathematical methodology, and thus can avoid the subjective influence introduced by researchers and can be applied in a wider range of related indicator evaluation.
In the future, we would like to conduct further research on the combination of TDA methods and deep learning. It is also important to address the problem of fitting nonequal-length data to persistent homology methods, as in practice time series data at a specific point can be missing, and the existing TDA methods require sequences of equal length on which to perform transitions. Future work would play a significant role in the practical application of TDA methods. In addition, more studies can be carried on to reveal the relationships between university development and its subject background as well as many other factors. The designing of evaluation methods for combining existing rating system with the growing potential of university level is also a big challenge. In brief, the research of quantitative university evaluation still has a long way to go.