*Article* **Data Mining of Students' Consumption Behaviour Pattern Based on Self-Attention Graph Neural Network**

**Fangyao Xu † and Shaojie Qu \***

Beijing Institute of Technology, Beijing 100081, China; 1120180084@bit.edu.cn

**\*** Correspondence: qushaojie@bit.edu.cn

† Current address: No. 8, Liangxiang East Road, Fangshan District, Beijing 102488, China.

**Abstract:** Performance prediction is of significant importance. Previous mining of behaviour data was limited to machine learning models. Corresponding research has not made good use of the information of spatial location changes over time, in addition to discriminative students' behavioural patterns and tendentious behaviour. Thus, we establish students' behaviour networks, combine temporal and spatial information to mine behavioural patterns of academic performance discrimination, and predict student's performance. Firstly, we put forward some principles to build graphs with a topological structure based on consumption data; secondly, we propose an improved self-attention mechanism model; thirdly, we perform classification tasks related to academic performance, and determine discriminative learning and life behaviour sequence patterns. Results showed that the accuracy of the two-category classification reached 84.86% and that of the three-category classification reached 79.43%. In addition, students with good academic performance were observed to study in the classroom or library after dinner and lunch. Apart from returning to the dormitory in the evening, they tended to stay focused in the library and other learning venues during the day. Lastly, different nodes have different contributions to the prediction, thereby providing an approach for feature selection. Our research findings provide a method to grasp students' campus traces.

**Keywords:** self-attention mechanism; graph neural network; data mining; behaviour sequence pattern; behaviour network

#### **1. Introduction**

Methods to improve education quality by mining off-line education and on-line learning platform data [1] has led to the development of educational data mining (EDM) [2]. Among the several problems in the field of EDM, predicting students' scholastic performance is a key issue [3,4], and various statistical methods [5,6] and tools [7] have been developed to perform this task. However, these methods could not reflect the learning conditions of students over a specific period, and do not facilitate the discovery of new knowledge patterns from the data set for the development of new and accurate models. Advancements in machine learning has led to the emergence of powerful data visualization methods and a variety of models, such as clustering, classification, and prediction, including algorithms that can dynamically process data streams [8], and other algorithms such as the support vector machine have been used to detect students who may fail in courses as an early warning [9]. Research on machine learning using behavioural data for performance prediction has attracted a significant amount of attention, such as grade prediction using online behaviour [10–12], gaming behaviour [13], consumption behaviour [14] and travel behaviour [15]. To conduct on-line behaviour mining more deeply, artificial neural networks have been used for log data mining and desirable results have been achieved [16]. However, with regard to knowledge tracking, such as the prediction of student's test questions, artificial neural networks have not yielded satisfactory result, and recurrent neural networks (RNNs) have had to be introduced [17]. Further, when dealing with long sequences, RNNs found that gradient explosion and gradient disappearance

**Citation:** Xu, F.; Qu, S. Data Mining of Students' Consumption Behaviour Pattern Based on Self-Attention Graph Neural Network. *Appl. Sci.* **2021**, *11*, 10784. https://doi.org/ 10.3390/app112210784

Academic Editor: Federico Divina

Received: 14 October 2021 Accepted: 12 November 2021 Published: 15 November 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

were prone to happen [18]. Therefore, to overcome such shortcomings, long short term memory networks (LSTM) were introduced [19]. LSTM can well solve the prediction problem with temporal data, while it has defects in processing spatial data. Further, the same behaviour will lead to ambiguity in the behavioural purpose if the difference in spatial location is ignored. For example, the behavioural meanings of fetching boiled water in the teaching building, in the dormitory, and in the bathroom are obviously different. We can, respectively, speculate that the purposes of corresponding behaviours are to study, to play games in the dormitory, and to take a bath. The lack of spatial location information will affect the prediction accuracy of models and the analysis of discriminative behavioural patterns. If the information in the spatial position was considered to mine behavioural characteristics contained in consumption data, the graph topological structure formed by behaviour cannot be well mined, and the above methods lost the ability to have good performance for such data with certain special structures. Naturally mining the data with graph structure for performance prediction requires the introduction of new tools. At present, graph neural networks (GNNs) have undergone rapid developments. Further, such networks have been extensively used [20–22] and are suitable for dealing with graph structure data. We observed that the behaviour of students can be demonstrated in the form of a graph and particular behavioural pattern and tendency can be observed, which inspired us to use graph-related tools to mine hidden information in behavioural patterns using students' spatial location changes over time.

In this study, we aimed to utilize the information regarding spatial position changes over time reflected by consumption data collected from a school to extract behavioural pattern features and construct graph structures for mining discriminative behavioural characteristics and behavioural trends related to academic performance, as well as for performance prediction. Firstly, we proposed two guidelines for constructing the graph structures by extracting features from the consumption behaviour data. Secondly, we proposed an improved self-attention mechanism model based on previous graph selfattention mechanism; thirdly, we made use of graphs composed of behaviour characteristics to perform classification and determined discriminative learning and life behaviour sequence patterns.

The remainder of this article is as follows: In Section 2, we present some additional current research on consumption behaviour and discuss the possibility of using a GNN for graph mining. In Section 3, we explain the data we used in our study as well as the processing methods; we also describe the method of graph construction and the improved model. In Section 4, we report the results of the experiment and analyse the results obtained. In Section 5, we propose some open issues, summarize the paper and also present some shortcomings of our research.

#### **2. Related Work**

Continuously adapting to rapid development and educational innovation is highly crucial and has led to the emergence and application of EDM in various fields. Scholars investigated the three aspects of student performance, teaching equality, and policy making, and found student performance had the greatest significance [23,24]. Related research had also focused on performance prediction and the discussion of methods [25] and models [26], including data [27], and models and methods were considerably improved for different purposes.

From the perspective of students, performance prediction [28,29], early warning of failure in subjects [30], sentiment analysis [31] and course recommendation [32,33] are key issues; besides, the trajectory of school behaviour is rich in information of learning habits of students to mine, and analyses of behavioural patterns based on spatial location change have been widely carried out for performance prediction. The trajectory of students' behaviour at school helps us understand the various characteristics of learning status and lifestyle [34]. Dalvi [35] focused on students' green and low-carbon behaviour, and Islam [36] studied electronic product consumption; in both these studies, only the lifestyles of students were investigated, and the impact of such consumption bahaviour on

academic performance was not studied. Mei [37] used campus behaviour data to predict academic performance. However, discriminative learning and life patterns that can help distinguish students with different learning levels was unclear. Further, time and location information has not yet been considered in most studies. Li [14] focused on behaviours that could reflect the regularity of students' lives; however, behavioural patterns are not discriminative enough by nature. In study [38], based on the behaviour records of undergraduates' smart cards, the authors studied the impact of students' diligence and the regularity of their daily life on grades. Due to a lack of spatial information, the authors had to regard different behaviours as the same behaviour and the prediction results were affected. The main reason for this is the lack of consideration about spatial information. Thus, consideration of behavioural patterns with spatial location information is necessary.

Advanced statistical methods were extensively applied in EDM during its early stage, such as the *t*-test [39], for the prediction of academic performance [40,41]. Statistical methods were suitable for small sample data; otherwise, it was necessary to put forward intelligent algorithms. Machine learning algorithms were utilized for predicting academic performance [42] to warn students who might fail in certain courses [43] and predict the graduation rates [44]. Regarding machine learning algorithms that did not perform well on time series data, the improved recurrent neural network (RNN) based on artificial neural networks showed a good mining effect [45], and appeared to be ineffective for dealing with non-European data, such as those with a graph structure. There was also a lack of in-depth mining on student behaviour tendency reflected by the spatial location information. Due to the particularity of graph structure, it was also difficult to use general models to process and analyse. We therefore studied behavioural characteristics from the perspectives of spatial topological structure and time dimension, and we introduced a new and powerful tool for dealing with non-European structure data.

As a powerful tool for processing non-European structure data, graph neural networks have undergone rapid development and been wide applied, such as the knowledge graph [46], natural language processing [47], graph-based text representation [48] and graph embedding techniques [49]. In particular, some scholars have proposed the graph attention mechanism to improve the performance of node classification [50]. Some scholars have recommended graphs for recommendation systems [51], such as the music recommendation system in mobile networks [52,53] because of graphs' powerful information representation abilities and wide applications. Specifically, Zhang [54] used bipartite graph to perform context-sensitive web service discovery. Notably, community detection is also a key task [55,56]. Consumption behaviour at school has structure and characteristics similar to those of social networks and graph. A topology structure must therefore be introduced to distinguish behavioural patterns and find students' behaviour tendency. Inspired by the node classification method, we improved the present self-attention GNN to mine consumption behaviour data from both time and spatial aspects.

#### **3. Methods**

#### *3.1. Data Description*

As the behaviour of students is often the same regardless of the semester, we only collected behaviour data over a single month. The activities of first-year students are relatively messy due to their curiosity and are difficult to analyse, while the third-year and fourth-year students need to engage in some social work outside the school, which leads to very short time at school and consumption behaviours are lacking and inconvenient to analyse. Therefore, we considered the consumption data of all second-year students at Beijing Institute of Technology, which is characterized by science and engineering majors and has a male to female ratio of 2.2 to 1. Students use campus cards for on-campus consumption, and the corresponding data are transmitted and stored in the school's campus card consumption system later, which is relatively convenient to access. Thus, from December 2020 to January 2021, we collected 752,725 pieces of consumption data generated during May 2020 and 3640 pieces of final exam scores of the course of

data structure from two data systems, including campus card consumption system and educational administration system. The collected consumption data are detailed in Table 1, and a more detailed explanation regarding the "action" is presented in Table 2.


**Table 1.** Collected raw data.

**Table 2.** Specific explanation of different consumption behaviours.


#### *3.2. Data Preprocessing*

Based on the above data, we performed certain preprocessing measurements, using the following pseudo-code of Algorithm 1.


In collected data, "management office" refers to students' short-term campus card recharging behaviour; "school bus" refers to the behaviour of commuting between campuses, and the uncertainty of destination campus makes it impossible to analyse behavioural purpose; as for "gym", the overall time and purpose of staying in the gym based on this record cannot be inferred. Hence, we deleted related behaviours in the preprocessing algorithm. After the data preprocessing, we got the data of 3616 students and next considered extracting features from the following three aspects:


After obtaining the features, we performed the chi-square test, f-test and other feature selection methods, and finally determined thirty relevant features related to the grades. Table 3 presents some of the information related to the features that were extracted.

We noted that the features related to consumption money were intermediate, which meant that excessive consumption and low consumption were both abnormal phenomena. Thus, this type of feature was transformed to the maximum feature using Formula (1):

$$\mathbf{x}\_{i} = 1 - \frac{|\mathbf{x}\_{i} - \mathbf{x}\_{\text{best}}|}{\max\{|\mathbf{x}\_{i} - \mathbf{x}\_{\text{best}}|\}} \tag{1}$$

The same method was applied to the other similar intermediate indicators, as presented in Table 3.



**3.**Extractedfeaturesbasedonstudents'dataandreality.

#### *3.3. Graph Construction*

After the extraction of the behavioural sequence features, the data lost the natural graph structure. As such, construction of suitable graphs from these features became a challenge. To reflect the continuity and trend of students' behaviours, we constructed graphs based on the characteristics of students' behaviours at school and daily experience and regarded the features as nodes.

#### 3.3.1. p-Clique

For a graph (*V*, *E*), *V* refers to a set of elements called vertices and *E* is a multiset of unordered pairs (*u*, *v*) whose elements are called edges. Two vertices are said to be adjacent if there exists an edge between them. A clique in a graph refers to a set of pairwise adjacent vertices, and if the number of vertices involved is p, it is called a p-clique. The behavioural patterns of students in their daily school life are often consistent. While some patterns may not be related to one another obviously, the extracted behaviour sequence features will be interrelated. This interrelationship between the features is an important indicator that can be used to distinguish between students with different learning levels. Based on this, we arranged part of the extracted behaviour sequence features into a p-clique in the constructed graph.

#### 3.3.2. Other Criterion

Two nodes are said to be connected and interrelated if there exists an edge between them. Hence, for two nodes to have a relationship, we considered that the following two criteria must be fulfilled:


Based on the above-mentioned criterion, we constructed graphs for each student, creating a total of 3616 graphs for graph-level classification, using 700 graphs as the test set. Below, we present two graphs that were constructed according to the above-mentioned method (Figure 1).

**Figure 1.** Different graph structures built for different students. The differences resulted from the students having different lifestyles. (**a**) Graph constructed for student with exam score 84.88 whose ID number equals 17347; (**b**) graph constructed for student with exam score 91.28 whose ID number equals 61465.

#### *3.4. Model Description*

We proposed an improved self-attention GNN based on previous research. The corresponding Algorithm 2 pseudo-code of improved self-attention model is as follows.

#### **Algorithm 2** Graph classification based on improved self-attention GNN

**Input:** Adjacency matrix of constructed matrix *Wadj*, node feature vector *V*, graph indicators *P* and degree matrix *D*

**Output:** The prediction labels on the test set


In the standard self-attention mechanism, given a group of nodes (*x*1, *x*2, ··· , *xk*) and weight matrices *WQ*, *WK*, *WV* that represent different linear transformations of features, the attention coefficients are computed to reflect the pair-wise importance of the nodes, as shown in Equation (2):

$$\mathbf{c}\_{i\dot{j}} = \left(\mathbf{W}\_Q^T \mathbf{x}\_{\dot{i}}\right)^T \left(\mathbf{W}\_K^T \mathbf{x}\_{\dot{j}}\right), \forall \mathbf{1} \le i, j \le k \tag{2}$$

Then *eij* is normalized by all possible values of *j* using the Softmax function as shown in Equation (3):

$$\alpha\_{ij} = \frac{\exp\left(\varepsilon\_{ij}\right)}{\sum\_{1 \le l \le k} \exp\left(\varepsilon\_{il}\right)}\tag{3}$$

Finally, a weighted sum of transformed features is calculated as shown in Equation (4):

$$\vec{d}\_{\vec{l}} = \tanh\left(\sum\_{1 \le j \le k} \alpha\_{i\vec{j}} \mathcal{W}\_V^T \mathbf{x}\_{\vec{j}}\right) \tag{4}$$

Additionally, a new node embedding vector set can be obtained when using a multihead graph attention layer. In graph attention neural networks, the node used for the attention mechanism generally only aggregates the information of the first-order neighbours to update the information. In the improved model, due to the characteristics of the extracted features, and to utilize the neighbouring information of node *vi* better, we applied two graph convolution layers. In the convolution layers, we calculated the node embedding using a linear transformer *W* and used the activation function ReLu to calculate the raw attention score between pair-wise nodes using Equations (5) and (6):

$$Z = \text{GCN}(L, V, \mathcal{W}, \varepsilon) \tag{5}$$

$$\mathcal{e}\_{\vec{i}\vec{j}} = \text{ReLU}\left(\vec{a}^T \left(z\_i^{(l)} \|z\_j^{(l)}\right)\right) \tag{6}$$

where*a* is a weight vector for learning and refers to concatenation. Next, we applied the SoftMax function to normalize *eij*, and a weighted sum based on attention on the features of all the neighbour nodes, as shown in Equation (7):

$$h\_i^{(l+1)} = \text{ReLU}\left(\sum\_{j \in \mathcal{N}(i)} \alpha\_{ij}^{(l)} z\_j^{(l)}\right) \tag{7}$$

N (*i*) refers to all the neighbours of node i. We also used the pooling method to handle redundant information and reduce the amount of calculations. The extracted features denote the occurrence frequency of consumption behaviour sequence patterns and the ReLu function shows the best performance on our dataset.

#### **4. Result and Analysis**

We compared our improved self-attention GNN to some other graph neural network variants on two classification methods, including some machine learning models. We will now discuss the difference in the loss reduction of the variant GCNs and the improved self-attention GNN mainly. Due to the uneven distribution of the labels, the results are weighted as demonstrated below.

#### *4.1. Experiment Result*

Since the proportion of students with a score of less than 60 was very small, we set a higher passing line. We used the score of 70 to divide the students' scores into two categories defined as {1, 2} and conducted a training task. The student group whose scores are all greater than or equal to 70 is labeled 1, and the rest of the group whose scores are less than 70 is labeled 2. To construct a large graph and speed up calculations, we first batched all the training graphs, and then trained the self-attention GNN with 300 epochs, as shown in Figure 2. Compared with the other GNN variants trained using the same number of epochs, the loss of our improved model varied sharply during the training process. Table 4 lists the performance of different models.

**Figure 2.** The processes aim at the two-category classification, in which "self-attention GNN" refers to our improved model.


**Table 4.** The two-category classification results on the test set.

As can be seen from Figure 2, the cross-entropy loss reduction of the improved selfattention GNN was very fast, while it could quickly reach stability. While the descent process of the SAGEConv GNN exhibited fluctuations, the cross-entropy loss reduction was much smoother than our improved self-attention model. The remaining two graph neural network variants were both highly stable. This revealed the high sensitivity of our improved GNN model to variations in the data.

It was difficult to analyse the behavioural patterns of students who were outstanding in the two-category classification. Therefore, we further divided the students whose scores were greater than 70 into two categories and conducted a three-category classification task. Then, we used the scores of 70 and 85 to divide the students' scores into three categories defined as {1, 2, 3} for multi-category classification. Specifically, students with a score of less than 70 are considered as failing, students with a score of 70 or above and not exceeding 85 are considered as good and students with a score of 85 or above are considered excellent, corresponding to label 1, label 2 and label 3, respectively. We then performed 300 epochs. We observed that the process of loss reduction gradually stabilized, as shown in Figure 3. The performances of different models are listed in Table 5.

**Figure 3.** The processes aim at the three-category classification, in which "self-attention GNN" still means our improved model.


**Table 5.** The three-category classification results on the test set.

The improved self-attention GNN showed a high degree of data sensitivity from Figure 3; after reaching stability, the cross-entropy loss of our model still showed some fluctuations. After training SAGEConv GNN for 300 epochs, the cross-entropy loss kept on declining and fluctuating from the declining process of the cross entropy loss. Therefore, our improved self-attention GNN was better than SAGEConv GNN in this regard, which implied that the two models may have high data requirements and data sensitivity.

Based on the idea of hypothesis testing, we constructed an indicator based on the discriminative behavioural patterns to reflect the differences in behavioural patterns using Formula (8):

$$\text{ratio } = \frac{\sum\_{i=1}^{n} \ge\_{i} / n}{\sum\_{j=1}^{m} y\_j / m} \tag{8}$$

In the two-category classification, *x* represents the number of occurrences of some behavioural patterns of outstanding students, while *y* represents the number of occurrences of corresponding behavioural pattern of lagging students; in the three-category classifications, three ratio indicators were used regarding three score categories, respectively. The numerator and denominator were divided by the corresponding number of people to make sure the number of people did not have an impact. Taking the two-category classification as an example, the related results are presented in Table 6.

#### *4.2. Result Analysis*

From the perspective of cross entropy loss decreasing process in Figures 2 and 3, compared with other models, the improved self-attention model had higher cross-entropy loss in the initial stage of training, while with certain fluctuations, it could decrease and converge rapidly in both classification tasks. The initial cross-entropy loss and training processes of the other three graph neural network model variants were similar in twocategory classification, which may be related to the fact that these models treated the nodes indiscriminately during each iteration. However, the loss of SAGEConv GNN maintained the downward trend in three-category classification. Judging from the model's prediction accuracy and other performance, we could speculate that although the loss declined, the model might have been over-fitting, which led to poor performance on the test set. In addition, the training process of both our improved model and SAGEConv GNN demonstrated larger fluctuations, which revealed that they might be more sensitive to label distribution.



From the perspective of prediction accuracy in Tables 4 and 5, the improved selfattention GNN performed better than any other model in both the classification tasks and achieved accuracy of 84.86% in two-category classification and 79.43% in three-category classification, respectively. Moreover, precision could reach 94.84% and 97.20%, and F1 score was able to achieve 91.81% and 87.28%, respectively. Indicators, such as precision, recall score and F1-score, revealed that improving the self-attention model could improve the prediction performance and converge to stability rapidly. Judging from several indicators, three variants performed similarly but worse than our improved model, in which the accuracy reached 68.57%, 65.76%, 70.43% and 64.71%, 66.06%, 67% in the two tasks, respectively. They focused on improving graph convolution methods. Compared with machine learning algorithms, the overall performance of the graph neural network variants was found to be superior, demonstrating the graph's strong and in-depth ability to mine mutual influences between local nodes. As comparison, the machine learning models equalized the input features, ignoring their potential mutual influence, which resulted in their relatively unsatisfactory performance in some extent. The KNN classification performance was also relatively good, in which accuracy and recall rate were the same, reaching 84% and 79.14% in the two tasks, respectively. However, the KNN could not predict the behavioural trend and identify discriminative behavioural patterns. Its performance was dependent on the training set and the way the distance between data points was defined, which limited its applicability. The decision tree can give judgments through internal nodes and determine classification results based on leaf nodes; however, the depth and the number of leaf nodes of the decision tree for good predictive effect had certain uncertainty and it could not predict students' behavioural tendency similar to the KNN. It was more suitable for ensemble learning to improve the accuracy, while in this research accuracy only achieved 76% and 68.05% in two tasks and bore a resemblance to the remaining three indicators. As for logistic regression, this method could be regarded as generalized linear regression, while a more complex non-linear relationship cannot be expressed, not to mention mining local mutual influence. The above shortcomings contributed to such a situation: Accuracy equaled to recall rate in both tasks by 61.14% and 48.29%, precisions were 81.39% and 72.8% and F1-scores were 66.64% and 53.49%. By paying more attention to the local structure information using the GNN, better local mutual information mining could be performed, except improving the prediction accuracy of our method. The improved self-attention GNN could further identify discriminative behavioural patterns and predict the behavioural trends of students from the node scores and the existence of edges, which provided an insight into the behavioural patterns of outstanding students.

Next we discussed the prediction of students' behavioural trends. Because of the use of the self-attention mechanism, different feature nodes contributed to the performance prediction differently. We utilize different colour shades to represent the scores of the different nodes as shown in Figure 4. The darker the colour, the higher the score, demonstrating the relationship between the different behavioural patterns and their contribution to predicting the academic performance of the student. The first figure (Figure 4a) represents one of the graph trainings resulting from the two-category classification task, followed by three-category classifications. In two different classification tasks, different nodes showed different contributions for grade prediction.

**Figure 4.** The score of the graph node after improved self-attention GNN training is indicated in each node by colour shade. (**a**) Two-category classification for students whose ID number is 61465 and exam score is 91.28; (**b**) Three-category classification for student whose ID number is 61465 and exam score is 91.28.

Among them, the paid-days (node 2) had the highest score, as can be seen from Figure 4. The results showed that daily life habits of the student were relatively consistent and regular. In addition, we noticed that the behavioural pattern of studying after dinner and then returning to the dormitory (node 25), and the number of times to study after lunch (node 15) influenced the prediction greatly. In addition, we noticed that whether the student at least ate lunch and dinner during the day impacted on their academic performance; this behavioural pattern was also a reflection of the regularity in their daily life. Additionally, taking the graph shown in Figure 4a as an example and considering the behavioural tendency, it was most likely that the student would study and then return to the dormitory after finishing dinner according to single node 21. Among the neighbours of node 4 (number of times eating both breakfast and lunch or both lunch and dinner), the influence of node 6 (number of dinners) was relatively large, which showed that if the student tended to enjoy dinner, the student was most likely to eat all three meals. It is therefore possible to create a similar analysis for other students as well.

Taking discriminative behavioural patterns into consideration, in Table 6, except for following behavioural patterns, such as the number of days with consumption records, returning to the dormitory after dinner, and the total number of meals over a month, the ratios of the remaining indicators exceeded 2. On average, the number of occurrences of these behavioural patterns of outstanding students exceeded twice that of lagging students. Further, we speculated that some corresponding behavioural patterns of some top students occurred more frequently. The visualization results were consistent with the above-mentioned results, thereby proving the correctness and rationality of our conclusions. Therefore, we believe that these characteristics could be used as a reference to distinguish students with regard to their learning levels.

#### **5. Open Issues and Conclusions**

#### *5.1. Open Issues*

In fact, judging from the results, GNN still needs to be improved in EDM. Below we list some open issues for further research:


#### *5.2. Conclusions*

In this study, we focused on mining on-campus consumption data to identify discriminative behavioural patterns based on spatial location change, analyse behaviour trends, and predict students' academic performance by constructing behaviour networks. Firstly, we preprocessed collected consumption data to extract features that could reflect students' living habits and their learning status. Secondly, we attempted to construct campus behaviour networks from reality and artificially, including introducing p-clique. Thirdly, combining with the pooling method, an improved self-attention GNN was utilized for training and prediction, and good prediction performance on the test set was achieved. For discriminative behavioural patterns, the habit of getting up early and behavioural patterns of continuous learning until meal time (lunch or dinner) in classroom and learning after three meals were discriminatory for distinguishing students with different learning levels, since the defined ratios of these behavioural pattern features are greater than 2 and we held certain beliefs that these characteristics are discriminatory in the sense of average. These new knowledge discoveries of behavioural patterns were consistent with the visualization results, conformed to the actual situation, and had certain reference meaning. For the behavioural trend analysis, judging from the behavioural pattern represented by single node 21, the student was more likely to continue studying after dinner than to return to the dormitory; from the perspective of node interaction and the existence of edges (e.g., node 4 and its neighbour nodes), three meals of students who often eat dinner were also relatively regular.

However, some limitations need noting regarding the present study. An arguable weakness is that all the graphs we consider are still not large samples. When we consider multi-category classification, the current method may fall into the shortcomings of few training samples for each category. In the meantime, few-shot learning may contribute to solving the above problem. Another weakness is that we discard some information, such as excluding behaviour related to gym and school busses. In addition, application limitation exists in our model and it is hard to apply in other universities.

As for future research directions, we were considering building a bigger behavioual network involving all students at school by their consumption record and some useful and possible video material to carry out analyses for different purposes, such as detecting absenteeism.

**Author Contributions:** Conceptualization, S.Q. and F.X.; methodology, F.X.; software, F.X.; validation, S.Q.; formal analysis, F.X. and S.Q.; investigation, S.Q.; resources, S.Q.; data curation, F.X.; writing original draft preparation, F.X.; writing—review and editing, S.Q.; visualization, F.X.; supervision, S.Q.; project administration, S.Q.; funding acquisition, S.Q. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy, including code.

**Conflicts of Interest:** The authors declare that there is no conflict of interest regarding the publication of this paper.

#### **References**


## *Article* **Deep Learning-Based Water Crystal Classification**

**Hien Doan Thi 1,†, Frederic Andres 2,\*,†, Long Tran Quoc 1,†, Hiro Emoto 3,†, Michiko Hayashi 3,†, Ken Katsumata 3,† and Takayuki Oshide 3,†**


**Abstract:** Much of the earth's surface is covered by water. As was pointed out in the 2020 edition of the World Water Development Report, climate change challenges the sustainability of global water resources, so it is important to monitor the quality of water to preserve sustainable water resources. Quality of water can be related to the structure of water crystal, the solid-state of water, so methods to understand water crystals can help to improve water quality. As a first step, a water crystal exploratory analysis has been initiated with the cooperation with the Emoto Peace Project (EPP). The 5K EPP dataset has been created as the first world-wide small dataset of water crystals. Our research focused on reducing the inherent limitations when fitting machine learning models to the 5K EPP dataset. One major result is the classification of water crystals and how to split our small dataset into several related groups. Using the 5K EPP dataset of human observations and past research on snow crystal classification, we created a simple set of visual labels to identify water crystal shapes, in 13 categories. A deep learning-based method has been used to automatically do the classification task with a subset of the label dataset. The classification achieved high accuracy when using a fine-tuning technique.

**Keywords:** water crystal; deep learning; fine-tuning; supervised; classification

### **1. Introduction**

Along with the development of society, research on the human impact on nature is of greater and greater concern. Water quality [1] has become one of the main challenges that societies will face during the 21st century, as the United Nations brought water quality issues to the forefront of international actions under the Sustainable Development Goal 6. It is important to monitor how human actions will affect water quality and pollution issues. Water has always played an important role in the climatic ecosystem. Because most of our planet is covered by water, 70 to 90% of the human body (depending on age) is water. Testing water quality is simple, but as water can exist in different states or phases (liquid, solid, and gas), it can be made simpler. Advanced research [2] has been completed to understand water phases, resulting in the discovery of a new phase for water liquid. Water quality can be evaluated in each of the four phases. The focus of our research is the solid water phase known as frozen water crystal. We define "Frozen Water Crystal", a water crystal A microscopic crystal observed at the tip of a protrusion formed when liquid water is dropped in the form of drops onto a petri dish and frozen. The structure of the crystal is three-dimensional, and the crystal structure differs depending on the information possessed by the water. Crystals are formed when water changes to a solid-state, such as when it is frozen at −25 to −30 °C. Depending on the origin of the water and the formation process, crystals are divided into three main types: snow crystals, ice crystals,

**Citation:** Thi, H.D.; Andres, F.; Quoc, L.T.; Emoto, H.; Hayashi, M.; Katsumata, K.; Oshite. T. Deep Learning-Based Water Crystal Classification. *Appl. Sci.* **2022**, *12*, 825. https://doi.org/10.3390/app12020825

Academic Editor: Chuan-Ming Liu

Received: 31 October 2021 Accepted: 10 December 2021 Published: 14 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

and water crystals. From the shape of the crystal, the purity level and the texture are clearly reflected, which then enables us to assess the quality of the water. Depending on the environmental conditions and the impact of the surrounding elements, the same water can give many different shapes. Each type of shape of crystals can be considered to be unique, without repetition. Many studies have been performed on classifying snow crystals and ice crystals [3–6]. In recent years, the application of deep learning on crystals became popular in the research world with the publication of 3D Crystal classification [7]. However, no real classification and understanding of water crystals based on deep learning have been completed until now.

This article innovation is the application of artificial intelligence for the first time on a dataset of photos of water crystals known as "the 5K EPP Dataset" collected with the collaboration of the I.H.M General Research Institute. This dataset is composed of highresolution photos captured by a microscope camera under laboratory conditions. All the photos have been stored and managed by the I.H.M General Research Institute. A simple water crystal structure definition has been proposed in this research, which references snow crystals classification from [5] and the EPP project. This definition provides an easy understanding of the water structure with a 2D image.

With non-prohibited purposes, the water crystal dataset can support other researchers interested in water with their dataset [8,9]. These results assess the affection of human beings on water crystals initialization and the quality of water.

Deep learning has been widely applied in many research fields and achieved surprising results, especially in image processing. Convolutional neural networks (CNNs) are a special kind of neural network analysing high dimensional features dataset such as image, video, etc. CNNs were developed with image processing in mind, which makes them computationally more efficient when compared to other multi-layer back-propagation neural networks. CNNs can be used to automatically extract features from the dataset, which simplifies the next process. These features are not only useful for specific tasks but also can help in other related tasks. This opens a new era for research with reduced effort to achieve good results. With its well-understood architecture, CNNs are nowadays used widely in many areas, including image and structure classification. However, deep neural networks (DNNs) trained by conventional methods with small datasets commonly show worse performance than traditional machine learning methods [10]. The deeper network requires more data to train. This limitation prevents the wide application of deep learning in any field, in which collecting and assembling big datasets is a challenge. With the 5K EPP dataset, we face the same problem. We use data transformation techniques to enhance our dataset and the fine-tuning method is used to help train the model so that the model can learn better from a small dataset. Class weight is also used in this research to solve the imbalanced dataset problem. To build a classifier with deep learning, we split our work into 2 main steps: feature extractor and classification. We built a deep learning model to extract meaningful features from the EPP dataset, then used those to classify water crystal structures. We used 2 different techniques to extract features from the EPP dataset: convolutional auto-encoder and Fine-tuning. The extracted features are then stacked in convolution layers to make a classifier. The convolutional auto-encoder (CAE) [11] has been widely applied in dimension reduction and image noise reduction. Because the CAE model can keep the spatial information of the original image and extract information gently by using the convolution layer, it is considered to be one of the most state-of-the-art techniques in deep learning nowadays. Furthermore, it is an unsupervised method and can be used with less effort than a supervised one. Fine-tuning is a useful method for improving the performance of a neural network. It helps the researchers achieve higher performance with less effort.

In fine-tuning, a model trained on a given task is used for another similar task. This method reduced the training time and effort to extract meaningful features from the original input. ImageNet pre-trained models have been used for the fine-tuning method. This paper is organized as follows. Related works are in Section 2, the 5K EPP dataset is described in Section 3, our dataset study approach and methods are in Section 4, Section 5 describes the experimental results, and the conclusions follow in Section 6.

#### **2. Related Works**

With a research focus to improve precipitation measurements and forecast for over 50 years, scientific studies of meteorology and weather include the study of snowflakes, ice crystals, and water crystals. Snowflake studies provide some of the most detailed evidence of climate change. It impacts atmospheric science. One of the first attempts to catalog snowflakes was made in the 1930s by Wilson Bentley who created a method of photographing snowflakes in 1931, using a microscope attached to a camera. The Bentley Snow Crystal Collection (https://snowflakebentley.com/ accessed on 15 October 2020) includes about 6125 items. A general classification of snow crystals *Ta* − *s* diagram was proposed by Nakaya [3], which provides the most perfect classification from a physical point of view, with 7 categories. These categories include needles, columns, fern-like crystals developed in one plane, the combination of column and plane crystals, rimed crystals, and irregular crystals. The crystal images were collected from a slope of Mount Takachi, near the center of Hokkaido Island. Magono [4] published an improved version of Nakaya's classification, with the modification of and a supplement to Nakaya's classification of snow crystals. The results were obtained by laboratory experiments and from meteorological observation. The new classification provides temperature and humidity conditions, which can describe the meteorological differences in groups of asymmetric or modified types of snow crystals. It provides 80 categories, modified from Nakaya's categories and adding some new categories as well. Thirty thousand microscopic photographs of snow crystals taken by the Cloud Physics Group were used in their research.

Kikuchi and his team [5] proposed a new classification with 121 categories to classify snow crystals, ice crystals, and solid precipitation particles. They qualified their classification as "global scale" or "global" because their observations were performed from the middle latitudes (Japan) to polar regions. This classification consisted of three levels: general, intermediate, and elementary—which are composed of 8, 39, and 121 categories, respectively. Interestingly, this classification can be used not only for snow crystals but also for ice crystals. The deep learning method has been widely applied in many research fields, especially with image datasets. However, it faces the problem of working from a limited dataset. Fortunately, with the advent of image collection methods, a method to collect snowflake images was proposed: the Multi-Angle Snowflake Camera (MASC) [12]. It was developed to address the need for high-resolution multi-angle imaging of hydrometeors in freefall and has resulted in datasets comprising millions of images of falling snowflakes. Several studies have been published resulting from this development. A new method to automatically classify solid hydrometeors based on MASC images was presented by Praz et al. [13]. In this research, they proposed a regularized multinomial logistic regression (MLR) model to output the probabilistic information of MASC images. That probability is then weighed on the three stereoscopic views of the MASC to assign a unique label to each hydrometeor. The MLR model was trained using more than 3000 MASC images labeled by visual inspection. This model achieved very high performance with a 95% accuracy. Hicks et al. [6] published an automatic method to classify snowflakes, collected via Multi-Angle Snowflake Camera (MASC). The training dataset contains 1400 MASC images. They used a convolutional neural network and residual network which had been pre-trained with ImageNet as a back-bone for their model. Snowflakes were sorted by geometrics and divided into 6 distinct classes. Then, the degrees of rimming was decided by another training process, which has three distinct classes. Although the accuracy of this research is only 93.4%, it does provide a new way to classify snowflakes or nature structures automatically.

Another research with the MASC dataset was proposed by Leinonen et al. [14]. In this research, they aimed to classify large-scale MASC dataset by unsupervised learning method, using generative neural network (GAN) [15] and K-medoids [16]. With the features extracted from the discriminator part of the GAN model, they used the K-medoids algorithm to cluster all the images (data points) into 16 classes/categories. This method not only shows the hierarchical clustering groups but also requires no human intervention with such a large dataset. However, MASC images mainly show the crystal's degree of riming, but not the crystal's structure. This is because these images were taken during the falling progress of snowflakes.

In this research, we focus on building a new definition for water crystal classification based on previous studies and using deep learning to automatically classify them.

#### **3. The 5K EPP Dataset**

The water crystals have been provided by the Emoto Peace Project (EPP) at the I.H.M General Research Institute (Tokyo, Japan). Crystals were produced from water samples collected from many countries and sources, with the help of scientists all around the world. Water samples from each bottle are produced by the same procedure in [9]:


Known as the **5K EPP dataset** [17], this dataset contains 5007 crystal photos in total. Because the 5K EPP dataset contains very high-resolution images (5472 × 3648 pixels) and water crystals only occupy a small part in the images, we needed to preprocess each image to remove the background. We used Otsu's method [18] to automatically define the border around the crystals. The minimum rectangular box that can cover each water crystal was chosen to crop the background. This helps reduce the image size while retaining the details in the object. Because the size of the water crystal in each image is different, we resized the cut-off images to the same size, to fit with the input of our machine.

The preprocessed dataset was then sorted into 13 categories. Based on the knowledge from the EPP Laboratory experts, we chose those categories that appeared most frequently in the 5K EPP dataset as our labels. We built a tree-like diagram in Figure 1 to demonstrate how we split the 5K EPP dataset into smaller categories. The branches of the tree correspond to the category in the definition. Finally, we obtained 13 branches corresponding to 13 categories. The details are given in Table 1. We chose the most typical images for each category and labeled them with the predefined definition. We split the 5K EPP dataset into the training set and test set with ratios of 80 and 20, respectively. The scikit-learn (https: //scikit-learn.org/ accessed on 15 October 2020) library was used to split the dataset randomly and guarantee the balance in the dataset.

**Figure 1.** A tree-like diagram to demonstrate the water crystal categories with 5K EPP dataset.


**Table 1.** The definition for water crystal classes based on the knowledge from [5] classification.

#### **4. Proposed Method**

*4.1. Feature Extractor*

4.1.1. Residual Auto-Encoder

A convolutional auto-encoder (CAE) is an efficient technique used to reduce dimensionality and generate high-level representation from raw data. It is an unsupervised learning algorithm using a back propagation algorithm to update parameters. In this model, the targets are equal to the inputs. A convolutional auto-encoder is composed of two models: an encoder and a decoder. The encoder aims to find the latent representation for input data, while the decoder is tuned to reconstruct the original input from the encoder's output.

Considering a dataset *X* with *n* sample and *m* features, the encoder learns the latent representation *H* and the decoder tries to reconstruct the original input *X* from *H*, by minimizing the differences between *X* and *X* over all samples:

$$\min\_{W,W'} \frac{1}{n} \sum\_{i=1}^n \|D\_{W'}(E\_W(X\_i)) - X\_i\|\_2^2.$$

For a convolutional auto-encoder,

$$E\_W(X) = \sigma(X\*W) = H$$

$$D\_{W'}(H) = \sigma(H\*\mathcal{W}') = X'$$

where *W* and *W* are learnable parameters and *σ* is the activation function such as ReLU and sigmoid. At the end of the training process, the embedded code *H* is used as a new representation of input *X*. Then, *H* can be fed into a fully connected layer to do classifying or clustering tasks. We proposed a new CAE model to extract latent representation from high-resolution water crystal images. First, in the encoder, 3 convolution layers were stacked on the input images to extract latent representation. Then, the encoder's output was flattened to form a vector, which is an extracted feature. The decoder transformed embedded features back to the original image. The convolution (transpose) layers with stride allow the network to learn spatial subsampling (upsampling) from data, leading to a higher capability of transformation. Therefore, instead of using a convolution layer followed by a pooling layer, we used a convolution layer with a stride in the encoder and a convolution transpose layer with a stride in the decoder. To achieve low-dimension images with high-representation with very high-resolution images was a challenging task. Down-sampling images to get low dimension representation can lead to a vanishing gradient when training a very deep neural network model. With a traditional CAE, the greater number of hidden layers, the hard it is to reconstruct the original image. To solve this problem, we used the skip idea from ResNet [19]: skip connection. Skip connection addresses the problem with vanishing gradient and information lossless. The idea is that instead of letting the model learn underlying mapping, let it learn the residual mapping.

With skip connection, the residual or identity was added to the output. We obtained the output defined as follows: *y* = *R*(*x*) + *x*.

Because we had the identity connection come due to *x*, the model actually learned the residual *R*(*x*). We used two different kinds of residual block to build an encoder block: a regular block and a downsample block. As the regular block, the residual block has 3 convolution layers with the same number of output channels. The downsample block decreases the sampling rate of the input by deleting samples. When the block performs frame-based processing, it resamples the data in each column of the Mi-by-N input matrix independently. When the block performs sample-based processing, it treats each element of the input as a separate channel and resamples each channel of the input array across time. The resample rate is K times lower than the input sample rate, where K is the value of the Downsample factor parameter. The Downsample block resamples the input by discarding K – 1 consecutive samples following each sample that is output. Each convolution layer was followed by a batch normalization layer and a ReLU activation function. Then, we skipped these three convolution operations and added the input directly before the final ReLU activation function. This kind of design requires that the output of the three convolution layers be of the same shape as the input, so they can be added together. The downsample block had the same design as the regular one, but the first convolution layer reduced the image size and had a different number of channels. To add the input before the last ReLU activation function, we used a 1 × 1 convolution layer, followed by a batch normalization layer, to transform the input into the desired shape for the addition operation. By experimentation, we found that using two convolution layer after the first convolution layer in each block helps the model reconstruct output better.

The first convolution layer has been used to downsize the image by two and the following two were used to learn useful information. Each convolution layer is followed by a batch normalization layer and an activation layer (except the last one). In this research, we chose ReLU as an activation layer. The skip connection used convolution and batch normalization to reduce the size of the input so that it was equal to the output. The final architecture is shown in Figure 2. We used one convolution layer and three residual blocks to build an encoder. The decoder kept the same structure as the origin. The final model is called a residual auto-encoder (RAE). The reconstruction loss was used to evaluate the performance of the RAE model. The parameters of encoder and decoder were updated by minimizing the reconstruction error:

$$L\_r = \frac{1}{n} \sum\_{i=1}^n Distance(D\_{W'}(E\_W(\mathbf{x}\_i)), \mathbf{x}\_i).$$

**Figure 2.** A residual auto-encoder model to extract features from origin images. Each residual block is a combination of a Downsample block and a regular block, respectively.

Instead of using the Euclidean distance to compute the reconstruction error, we used the Spherical distance in [20]. The latent representations extracted from the RAE model were projected into the surface of the unit hyper-sphere. The distance between data points in that surface was then measured by *dspherical* function, which is defined as follows:

$$d\_{\text{spherical}} = \frac{\arccos(s\_{\cos(z\_i z\_j)})}{\pi} = \frac{1}{\pi} \langle \frac{z\_i}{||z\_i||\_2 + \epsilon}, \frac{z\_j}{||z\_j||\_2 + \epsilon} \rangle$$

where arccos(*α*) is the inverse cosine function for *α* ∈ [−1, 1] and is a very small value to avoid numerical problem.

#### 4.1.2. Fine-Tuning Model

The efficiency of the classification model is based on the power of the features extracted from the training dataset. With the high meaning features, the classifier can achieve good results from the very first training steps. Auto-encoder is a popular strategy used to extract features from the unlabeled dataset. Because it requires no label to train the CNN model, it can perform well on high dimension datasets, especially images and videos. The model then trains with our full dataset and learns the most important information from these images. However, to choose a good architecture for the auto-encoder and train it from scratch is not easy as it requires a great deal of knowledge about machine learning and specific datasets. A new method was introduced to help to solve problems with feature extracting, called fine-tuning. Fine-tuning is a process that takes a network that has already been trained for a given task and make it perform a second similar task. Many studies have been shown that fine-tuning techniques can get good results with less effort compared to starting from scratch.

For image related tasks, the most common way is fine-tuning the model trained on ImageNet [21] (with 1.2 million labeled images) by continuing to train it on the original dataset. A competition on classification and object detection has been organized to find state-of-the-art techniques to solve those problems on the ImageNet dataset, called ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (http://www.image-net.org/ challenges/LSVRC/ accessed on 15 October 2020). AlexNet [22] is the first larger-scale convolution model that does well on the ImageNet classification task, outperforming all previous non-deep learning-based models by a significant margin, and won the competition in 2012. After that, VGG [23] was proposed with the idea of deeper networks and much smaller filters, which is a significant jump in deep learning. ResNet [19] introduced residual blocks with skip connections, which allow the gradients to backpropagate through to the initial layers without vanishing. That model won the first prize in the ILSVRC 2015 competition with an error rate of just 3.6%. In 2016, a new model was proposed, called SqueezeNet [24]. This model achieved approximately the same level of accuracy as AlexNet with a much smaller number of parameters. So, it is suitable for building mobile applications. In the same year, DenseNet [25], a densely connected network, was proposed to improve higher layer architectures of the previously developed networks. The DenseNet architecture attempts to solve this problem by densely connecting all the layers: each layer gets the input from the previous layer's output. We used all those models as back-bone to build a model to classify our water crystals. All the experiment results are shown in Table 3.

#### *4.2. Classification Model*

The features extracted from each previous step are then feed into the classifier layers to build the classification model.

The classifier has 2 main parts: feature extractor and classifier. To build the extractor, RAE pre-trained and Image pre-trained models were used. With the RAE model, we keep only the encoder from the RAE model, which had been pre-trained with the EPP dataset, as a feature extractor. As with the ImageNet pre-trained models, the last layer is removed to get the latest features. The classifier includes three fully connected layers which are added on top of the feature extractor and then trained simultaneously with the labeled dataset. The overview of the final classification model is given in Figure 3.

**Figure 3.** Classification architecture overview. The feature extractor can be replaced by RAE pretrained model or ImageNet pre-trained model. The features are used as input for the next step. The classifier contains 3 fully connected layers, each of them is followed by ReLU and a dropout layer. The last FC outputs the predicted probability distribution. Softmax is added to get the final prediction.

Unlike training a model from scratch, we unfreeze early layers and train the whole network. A small learning rate has been chosen to let the classifier learn the patterns from the previously learned convolution layers in the pre-trained network. For further evaluation and improvement, we chose different metrics to compare the performance between different feature extractors. The comparison results are then shown in Section 5.

#### *4.3. Imbalanced Data*

Due to the crystal formation process in nature, the amount of data in each class is imbalanced. Therefore, when labeling the 5K EPP dataset, we realized that there is an imbalance between the number of images among the categories. Some categories include approximately 20% dataset while some others include just 2%. The details are provided in Table 2.


**Table 2.** The 5K EPP dataset summary.

To guarantee balance and accuracy when training the deep learning model, we used the class weight method. We simply provided a weight for each class which places more emphasis on the minority classes. Following that idea, the model can learn from all classes equally. Each class will be assigned a weight corresponding to the number of images inside. The weight can be calculated as follows:

$$w\_i = \frac{N}{\mathbb{C} \* n\_i}$$

where *wi*, *ni*, *C*, and *N* are the weight assigned to class *i*, the number of images of class *i*, the number of classes, and the total images of dataset, respectively. We also use *F*1-score metric to evaluate the model performance with an imbalanced dataset besides the standard evaluation metric. Both are described in Section 5.1.

#### **5. Experiments and Results**

#### *5.1. Evaluation Metric*

#### 5.1.1. Classification Accuracy

We used standard evaluation metrics to evaluate classification results. For all implementation setup, we set the number of classes equal to the number of ground-truth

categories that were used to label the dataset in Section 3. The performance is evaluated by the accuracy metric:

$$\text{ACC} = \frac{1}{n} \sum\_{i=1}^{n} (y\_i^{true} = y\_i^{pred}) \tag{1}$$

where *ytrue* is the ground-truth label, *ypred* is the prediction label, and *n* is number of images inside the test set. The test dataset is not used when training the model. The best model should have high accuracy for both training and test progress.

#### 5.1.2. *F*1-Score

With the imbalanced dataset, an efficient way to evaluate the model performance is using *F*1-score [26]. Instead of calculating the ratio of true prediction within the total images, *F*1-score measures accuracy by precision *p* and recall *r*. The formula for *F*1-score is defined as follows:

$$F\_1 = 2 \ast \frac{precision \ast recall}{precision + recall} \tag{2}$$

When *p* is the number of correct positive results divided by the number of all positive results returned by the classifier, and *r* is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive).

#### *5.2. Experiments Environment and Setup*

In this section, we discuss applying different pre-trained models used in fine-tuning with our 5K EPP dataset. For our experiments, we used an NVIDIA Tesla V100 SXM2 GPU, with 32GB of memory. The server used for running the experiments was Grid5000 [27] (https://www.grid5000.fr accessed on 15 October 2020), the French large-scale and flexible experimental grid platform consisting of 8 sites geographically distributed over France and Luxembourg. Each site comprises one or several clusters, for a total of 35 clusters inside Grid5000. This platform is dedicated to experiment-driven researches in all areas of computer science, with a focus on parallel and distributed computing including Cloud, HPC, and Big Data and AI. Our implementation is based on Python and Pytorch (https://pytorch.org accessed on 15 October 2020).

#### *5.3. Experiment Results*

#### 5.3.1. Residual Auto-Encoder Model (RAE)

We first trained the RAE model with an unlabelled dataset. Adam optimizer [28] was used to update model parameters, with learning rate *α* = 10−4. Regularization was used to reduce the overfitting problem, with *γ* = 10−5. We chose the number of images per batch equal to 32. The model was trained with 100 epochs. We used two different loss functions to train the RAE model: one Spherical citetran2019deep metric and the Binary Cross-Entropy (noted BCE). The reconstruct results built with both metrics are shown in Figure 4. Although the BCE loss function can reconstruct an image quite similar to the original one, when zooming out the image, we can see that some parts of the image are blurred and old content cannot be seen. With Spherical, the reconstructed image is the same as the original one. We also used the Structural Similarity Index (SSIM) [29] to assert the similarity among reconstructed images and the input. The average SSIM has been computed for both Spherical and BCE. In overall, Spherical's SSIM is 0.96, while the BCE's SSIM is 0.89. Therefore, we concluded that Spherical outperformed BCE.

**Figure 4.** Reconstruct image generated by RAE model train with BCE and Spherical metric separately. The SSIM index is calculated with each reconstructed image. Spherical one outperforms the BCE one.

#### 5.3.2. Classification Model

We trained the classification models proposed in Section 4. Stochastic Gradient Descent (SGD) with Nesterov momentum optimizer was used to update parameters, with learning rate *α* = 10−3, momentum Δ = 0.9 and regularization *γ* = 10−4. To enrich the dataset, we used transform techniques such as flip image (vertical and horizontal) with probability *p* = 0.5, rotating the image with a random degree in the range from −90 to 90 degree, random cropping. The classification model is first trained with 100 epochs.

As in Section 4, we used 2 kinds of pre-trained models to build classifiers: RAE pre-trained model and ImageNet pre-trained models. RAE model was trained with an unlabeled dataset, as mentioned in the previous result. With ImageNet pre-trained models, we chose the most popular deep learning models, which had won the ILSVRC: AlexNet, VGG, SqueezeNet, DenseNet, and ResNet. All parameters were adjusted during the training progress. The 5K EPP dataset was then divided into a training and a test set with a ratio of 80:20.

After visualizing and doing statistics on the prediction, we realized that the definition with 13 categories for water crystal was not too good and gave unclear instructions. The classification model sometimes misclassified between water crystal and its spatial form. Because we used a 2D image to classify the plates, it is hard to see the differences between a plate with and without space elements. As mentioned in [30], we should apply machine learning in a problem that humans can do well. We modified the definition to eliminate ambiguity in the labeling process. The major change was combining the space misclassified categories. Additionally, we also found a new category named double plates. Finally, we delivered 12 categories, which are defined in Section 3.

Two fully-connected layers are added on the top of modified models, followed by a ReLU activation layer. In addition to regularization, we also used the traditional dropout method to prevent overfitting problems [31]: a dropout layer is put after each fullyconnected layer, except the last one. We fine-tuned parameters and the applied data transformation techniques mentioned in Section 3 to enrich the dataset.

With the new dataset, the model has kept training with the same configuration and parameters. The new definition obtained significant performance improvement. The model could overcome the overfitting problem and obtain high accuracy for both training and test progress.

We used different pre-trained models as back-bone and trained the model with the same parameters and set up conditions. The standard accuracy and *F*1-score were calculated and compare among models. The results are shown in Table 3. The model trained with RAE outperforms models that used AlexNet and SqueezeNet as back-bone, with 4% higher than AlexNet and 8% higher than SqueezeNet in *F*1-score metric. Although other pretrained models (such as VGG, DenseNet) have high accuracy, the *F*1-scores of Alexnet and SqueezeNet are much lower. In addition, the loss values of VGG and DensetNet are two times bigger than the lowest one (e.g., RestNet). ResNet outperforms other models in both loss value and accuracy with 98.50% Top-1 accuracy and 97.25% *F*1-score. We concluded

that the ResNet back-bone is the best solution for our problem. When evaluating the model with test set, the ResNet accuracy is approximately 93%.


**Table 3.** Top-1 Accuracy and *F*1-score on 5K EPP training set.

#### 5.3.3. Comparative Model

To demonstrate the effectiveness of our model, we compared it with Hicks's model [6]. Both methods used ResNet pre-trained models as the backbone and used the fine-tuning method to train all the parameters.

In their study, Hicks et al. implemented a classifier to automatically classify geometrically and riming-degree of the MASC dataset. They used the ResNet pre-trained model to initialize the model parameter and added a new FC layer as a classifier layer. The model outputs the probability of 6 distinct snowflake categories, which is defined by Hicks et al. They used 2 CNN models to do distinct tasks: (1) classify crystal geometrics and (2) classify riming-degree. Based on the crystal structure classification purpose, we compared our model with Hicks' first model. We trained Hicks' model with the 5K EPP dataset and used classification accuracy to compare its performance to ours. The results are shown in Figure 5. Even though our accuracy is just slightly higher than Hicks', the training progress can show that our model is more stable and the convergence of our model is better than that of the Hicks'model.

**Figure 5.** Our proposed model compared with Hicks's model. Both implementations are trained on the 5K EPP dataset.

#### **6. Conclusions**

Based on the EPP water crystal dataset and the previous knowledge about snowflake classification, we proposed a simple water crystal definition, which can be used to classify the EPP dataset. We contributed a new data science dataset, called the 5K EPP dataset, with 5007 images split into 13 classes (12 categories + undefined). We proposed a deep learning-based method to automatically classify this dataset. We compared fine-tuning results between the residual auto-encoder model, trained with unlabelled EPP datasets, and ImageNet pre-trained models, and then selected the best one. With a fine-tuning technique and ResNet pre-trained model, we had a classifier with 93% accuracy. With this result, we are going to extend the 5K EPP dataset by applied the water crystal definition to label

the EPP water crystal 20K dataset. A new approach to using an unsupervised method to deal with the unlabeled dataset and find a new group of the water crystal structure will be targeted in further studies.

**Author Contributions:** Conceptualization, H.D.T., F.A. and L.T.Q.; methodology, H.D.T. and L.T.Q.; software, H.D.T.; validation, H.D.T., F.A., L.T.Q., H.E., M.H., K.K. and T.O.; formal analysis, H.D.T. and L.T.Q.; investigation, H.D.T., F.A., K.K., and T.O.; resources, K.K. and T.O.; data curation, K.K., T.O., M.H., F.A. and H.D.T.; writing—original draft preparation, H.D.T., F.A. and L.T.Q.; writing—review and editing, H.D.T., F.A., L.T.Q., H.E., M.H. and K.K.; visualization, H.D.T., F.A., L.T.Q., H.E., M.H. and K.K.; supervision, F.A., L.T.Q., H.E., M.H., K.K. and T.O.; funding acquisition, F.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Institute of Informatics (NII) under the GLO Internship program and Emoto Peace Project, Non-profit Organization.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data will be available for the data science.

**Acknowledgments:** We would like to thank you the National Institute of Informatics (Tokyo, Japan) for the support of the research and I.H.M General Research Institute (Tokyo, Japan) for their help in the Water Crystal classification. We are grateful to D'Orazio and the French Grid5000 programs (https://www.grid5000.fr (accessed on 15 October 2020)) for providing the grid infrastructures, advice, and user assistance.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**

