*2.4. Principal Component Analysis*

Principal component analysis (PCA) is a linear dimensionality reduction technique that can be used to extract information from a high-dimensional space by projecting it onto a lower-dimensional subspace. It tries to preserve the essential parts that have more variation of the data and eliminate the non-essential parts with less variation. It does this through a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on several numerical values) into a set of linearly uncorrelated variable values called principal components. In short, what the algorithm does is [5]:


Thanks to the PCA you can get:


In summary, PCA dimensionality reduction causes the least important attribute information to be removed, leaving only the data components with the highest variance, that is, the resulting data retains the maximum data variance. For this reason, although PCA is used to reduce the dimensionality in the data, may also be useful as a visualization tool, for filtering noise and for feature extraction.

#### *2.5. Determine the Number of Clusters and Evaluate Clustering Performance: Silhouette Coefficient*

Silhouette Coefficient or silhouette score is a cluster validity measure for evaluating clustering performance. To calculate the Silhouette score for each observation/data

point, the following distances need to be found out for each observation belonging to all the clusters.


Silhouette score, *<sup>s</sup>*(*i*), for each sample is calculated using the following formula:

$$s(i) = \frac{b(i) - a(i)}{\max\{a(i) - b(i)\}} \tag{9}$$

Silhouette coefficient values ranges from −1 to 1. Silhouette coefficients near +1 indicate that clusters are well apart from each other and clearly distinguished. A value of 0 indicates that the clusters are indifferent, or we can say that the distance between clusters is not significant and negative values indicate that clustering configuration may have too many or too few clusters. Since silhouette coefficients are used to study the separation distance between the resulting clusters it is possible to use it to select the number of clusters for clustering techniques.

## **3. Materials & Methods**

#### *3.1. Comparative Study of Mexican Universities*

The Comparative Study of Mexican Universities [17] is a research project developed by the General Directorate of Institutional Evaluation of the National Autonomous University of Mexico (UNAM) that systematizes, analyzes, and disseminates statistical series, compiled in official sources and recognized databases, which allow to contrast the development of Mexican universities in their substantive functions: higher learning, research, and dissemination of culture.

The CSMU is not a hierarchical classification (ranking) of Mexican higher education institutions but rather it is presented as an alternative to the existing rankings; because its objective is not to rate the universities or build regulations under certain assumptions about the quality or prestige of the institutions and their programs, in contrast, it seeks to provide items of information from public access sources, objective data that covers both the characteristics of institutions such as the substantive functions of university activities.

In this sense, the CSMU favors the presentation of raw data without the use of groupings or weightings, because this type of practice causes the results to always end up being questioned. These characteristics of the CSMU allow users to be responsible for establishing the comparisons and relationships that may exist among the different existing information items, or building indicators based on their own needs and analysis perspectives. Likewise, users are responsible for adapting their interpretations to the different characteristics that Mexican universities have among them [18].

The CSMU data in this study include 60 Mexican universities (45 public and 15 private) but the UNAM from 2009 to 2017. These universities concentrate more than 50 percent of Mexico's higher education enrollment. The database provides information on the following items:


The results of the study for each of these nine items are published on a dynamic web page with systematized information, which can be consulted through the Data Explorer of the Comparative Study of Mexican Universities.

#### *3.2. Application Instance: 60 ExECUM Universities*

The ExECUM database was split into two independent databases taking into account the factors of higher learning and research. The information contained into the higher learning independent database is as follows:

Teachers instructing


Number of graduated students.

• Level: Bachelor's degree, specialty, master's degree, doctorate.

Academic programs offered.

•Level: Bachelor's degree, specialty, master's degree, doctorate.

On the other hand, the information contained into the research independent database is described below:

SNI researchers

• Researchers: Candidate, level I, level II, level III

PROMEP academic bodies


PNPC postgraduates


• Specialty: International competence, consolidated, developing, newly created

## *3.3. Proposed Matrix Model*

Among many activities that take place within a university (management, dissemination of culture, sports activities, among others), the most important areas were the training of undergraduate and graduate students, as well as research. Only these two last items were taken into consideration for the proposed model. The available data referring to higher learning were used, such as: number of full-time or part-time teachers, maximum degree of studies, number of enrolled students, number of graduated students, and academic programs offered. While in research part, the number of research articles that are in different international indexes (JCR, ISI, Scopus, Latindex, Zentralblat Math, among others) can be considered, as well as the number of patents generated or citations in international journals.

Universities can be classified using the clustering strategies previously described and historical data. From available data it is possible to assign an order from highest to lowest; for example, considering the distance of the centroids with respect to the origin, the centroids closest to the origin imply a lower performance (fewer graduate students or fewer research articles generated).

Considering the dimensions already described, a matrix can be structured where the classification according to higher learning can be shown in the vertical axis and research in the horizontal axis, see Figure 1.

**Figure 1.** Graphic illustration of the matrix model proposed.

This model is divided into four classification quadrants: the first quadrant will contain static institutions, that is, with minor higher learning and minor research. The second quadrant will have consolidated institutions in higher learning; that is, those institutions with minor research and major higher learning. The third one will house consolidated research institutions, that is, with major research and minor higher learning. Finally, in the fourth quadrant will be the excellence institutions, this means that those universities on this site have the best results in both higher learning and research.

As mentioned above, and in order to locate the institutions, the original database was divided in two parts: part 1 corresponding to higher learning and part 2 corresponding to research. Each part was solved separately using the aforementioned clustering algorithms. In this way, the cases in Table 1 will be had for each clustering technique:


**Table 1.** Decision table used to locate an institution.

Using the matrix in Figure 1, arrows will be used to show existing the institutions transitions among the quadrants, indicating at the top of each one the year in which they occurred; on the other hand, the highlighted institutions will be those that remained in the same group throughout the study.

Regarding evaluation, two different types of results were obtained: those that include the UNAM and those that do not include it, its presence represents an imbalance for the instances since this institution is quite far from the others in terms of size and, hence, in their higher learning and research capacity, whereby the distances between this institution and the others are shortened.

To demonstrate the above, first PCA analysis was applied on the databases from higher studies and research, and then, it was analysed how many dimensions are necessary to maintain the largest possible variance of both databases. The results of the PCA analysis based on higher studies database from the years 2009 to 2017 are shown in Figure 2:

**Figure 2.** PCA analysis per component from 2009 to 2017 applied to higher learning database.

As it can be seen on Figure 2, only one component represents 85% of variance of the higher studies database. Similarly, an analysis using PCA considering research database is shown in Figure 3:

**Figure 3.** PCA analysis per component from 2009 to 2017 applied to research database.

In Figure 3 is shown that only one component represents around 81% of data variance of the research database. After PCA analysis and considering results in Figures 2 and 3, the graph that can be seen in Figure 4 was created; it might seem remarkable that the sum of the variation in this graph exceeds 100%. This is because, as commented in previous paragraphs, the CSMU database was separated into two databases corresponding to higher learning and research and then, reducing all the higher learning database dimensions to a single principal component that is projected on the ordered axis and reducing all the research database dimensions to a principal component which is projected on the abscissa axis; the graph was created maintaining a total data variance of 85% and 81% for every database respectively, for this reason the sum of both axes exceeds 100%.

**Figure4.**AveragePCAfrom2009to2017appliedtohigherlearningandresearchwithUNAM.

As it can be seen in Figure 4, UNAM is far away from other institutions and it causes that all of them are seen into a single group; however, by eliminating UNAM, as it can be seen in the Figure 5, a separation among the institutions becomes clear. At first glance the

IPN, UAM and CINVESTAV appear to be the best institutions in research, whereas IPN and UdeG are the best universities in higher learning. It should be mentioned that the previous graphs are only representative of the total data in a certain percentage of the total available information, because, in the case of the component under research, PCA maintains a total data variance of 81% for research database, while in higher learning component, a total data variance of 85% for higher learning database is maintained.

Principal component in research(81%)

**Figure 5.** Average PCA from 2009 to 2017 applied to higher learning and research without UNAM.

After setting aside UNAM from the instance, the number of groups to cluster were determined using the silhouette coefficient method. It was applied over research and higher learning instances to determine the number of clusters. The comparisons for the three clustering techniques are on Figures 6 and 7.

**Figure 6.** Silhouette coefficient comparisons for three clustering techniques solving research database.

**Figure 7.** Silhouette coefficient comparisons for three clustering techniques solving higher learning database.

The results given by the silhouette coefficient method show in Figures 6 and 7 that best number to classify the instances is 2. Moreover, this analysis is helpful for evaluating clustering performance where the highest values mean a better performance.

## **4. Results and Analysis**
