Next Article in Journal
Effects of Steam Injection on the Permissible Hydrogen Content and Gaseous Emissions in a Micro Gas Turbine Supplied by a Mixture of CH4 and H2: A CFD Analysis
Next Article in Special Issue
Multiscale Full-Waveform Inversion with Land Seismic Field Data: A Case Study from the Jizhong Depression, Middle Eastern China
Previous Article in Journal
Outflow from a Biogas Plant as a Medium for Microalgae Biomass Cultivation—Pilot Scale Study and Technical Concept of a Large-Scale Installation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Recognition of Geothermal Surface Manifestations: A Comparison of Machine Learning and Deep Learning

1
School of Geography and Tourism, Jiaying University, Meizhou 514015, China
2
Guangdong Provincial Key Laboratory of Conservation and Precision Utilization of Characteristic Agricultural Resources in Mountainous Areas, Jiaying University, Meizhou 514015, China
3
Institute of Deep Earth Sciences and Green Energy, College of Civil and Transportation Engineering, Shenzhen University, Shenzhen 518060, China
4
School of Mathematics, Jiaying University, Meizhou 514015, China
5
The Eighth Geologic Survey, Guangdong Geological Bureau, Meizhou 514089, China
*
Author to whom correspondence should be addressed.
Energies 2022, 15(8), 2913; https://doi.org/10.3390/en15082913
Submission received: 16 February 2022 / Revised: 22 March 2022 / Accepted: 13 April 2022 / Published: 15 April 2022

Abstract

:
Geothermal surface manifestations (GSMs) are direct clues towards hydrothermal activities of a geothermal system in the subsurface and significant indications for geothermal resource exploration. It is essential to recognize various GSMs for potential geothermal energy exploration. However, there is a lack of work to fulfill this task using deep learning (DL), which has achieved unprecedented successes in computer vision and image interpretation. This study aims to explore the feasibility of using a DL model to fulfill the recognition of GSMs with photographs. A new image dataset was created for the GSM recognition by preprocessing and visual interpretation with expert knowledge and a high-quality check after downloading images from the Internet. The dataset consists of seven GSM types, i.e., warm spring, hot spring, geyser, fumarole, mud pot, hydrothermal alteration, crater lake, and one type of none GSM, including 500 images of different photographs for each type. The recognition results of the GoogLeNet model were compared with those of three machine learning (ML) algorithms, i.e., Support Vector Machine (SVM), Decision Tree (DT), and K-Nearest Neighbor (KNN), by using the assessment metrics of overall accuracy (OA), overall F1 score (OF), and computational time (CT) for training and testing the models via cross-validation. The results show that the retrained GoogLeNet model using transfer learning has significant advantages of accuracies and performances over the three ML classifiers, with the highest OA, the biggest OF, and the fastest CT for both the validation and test. Correspondingly, the three selected ML classifiers perform poorly for this task due to their low OA, small OF, and long CT. This suggests that transfer learning with a pretrained network be a feasible method to fulfill the recognition of the GSMs. Hopefully, this study provides a reference paradigm to help promote further research on the application of state-of-the-art DL in the geothermics domain.

1. Introduction

The world is confronting three great challenges, namely overpopulation, resources depletion, and environmental deterioration, which intertwine with each other and influence the world’s sustainable development significantly. Geothermal energy, as an alternative resource for the 21st Century [1], is a green, clean, efficient, renewable, and non-carbon-based new energy. Together with other new energies, such as solar, wind, and biomass, it can play an essential role in energy saving and emission reduction, coping with energy shortages, environmental and climate change, and realizing the green and sustainable development [2,3,4,5]. However, the development of geothermal resources has been slowly progressing, although the proven long-term sustainability of geothermal energy has remained one of the attractions for further exploration and exploitation [6]. Geothermal resource exploration is the basis for its development and utilization, which first need geothermal potential mapping and evaluation. The surface manifestations of a geothermal system in a volcanic-geothermal area are generally the features that first stimulate mapping and exploration [1]. These manifestations, such as fumaroles, warm and hot springs, and mud pods, can give important hints to the availability and abundance of geothermal resources [7] and reveal the exploitation potential of them [8]. Geothermal surface manifestations (GSMs) are crucial for preliminary surveys of geothermal potential. Therefore, recognizing these features and manifestations is essential for geothermal potential mapping and evaluation.
A lot of GSMs, especially warm and hot springs, geysers, and fumaroles, have drawn plenty of attention from the geothermal community in recent decades. Many researchers investigated geothermal anomalies related to GSMs such as hot springs using geophysics [9,10,11], geochemistry [12,13,14], remote sensing [15,16,17,18,19], geographic information system (GIS) [20], statistical modeling [21] and conventional machine learning (ML) [6,22,23]. For example, Gentana et al. (2019) demonstrated that the fault system is correlated with the appearances of the GSMs in the Indonesia volcanic zone [24]; Freski et al. (2021) tested the effects of alteration degree, moisture, and temperature on laser return intensity for the GSMs. Most of these works revealed the distribution and formation of geothermal resources by integrating multi-source data with traditional approaches [25]. Compared to the previous ML models, the artificial neural network (ANN) is getting more and more attention nowadays [6]. Dramsch (2020) made an overview of the development of ML in geoscience in the past 70 years with an emphasis on technical explanations of some popularly used ML models [26]. Muther et al. (2022) outlined artificial intelligence (AI) technology integration with geothermal reservoir characterization and management; discussed its potentials, limitations, and ways forward; compared different statistical, numerical, and AI/ML methods; and put forward the concept of Geothermal 4.0 that stemmed from the concept of Industry 4.0 [6].
More recently, deep learning (DL), which has achieved unprecedented successes in computer vision and image interpretation in the latest decade, has witnessed its emerging uses in the geothermics domain. Gangwani et al. (2021) provided an approach for predicting geothermal energy production using a long short-term memory sequence to sequence encoder-decoder neural network architecture [27]. Shahdi et al. (2021) explored the applicability of four ML models (i.e., deep neural network, ridge regression, extreme gradient boosting, and random forest) in predicting subsurface temperatures in the northeastern United States using bottom-hole temperature data and geological information from 20,750 wells [28]. Yang et al. (2022) proposed an innovative method for identifying the formation temperature field based on a deep belief network and successfully applied the technique to identify the formation temperature field for the southern Songliao Basin, northeast China [29]. Besides the application of DL in the geothermal domain, DL has also found its way to the geophysical domain and more. For example, Petrov et al. (2022) investigated shape carving methods of geologic body interpretation from seismic data based on DL and found that the dilated fully convolutional network was suitable for handling the task of seismic data interpretation. However, it is unclear whether DL can be feasible for the recognition of GSMs. Inspired by this latest progress in the domains of geothermics and geophysics, we hypothesized that using DL technology could realize the recognition of GSMs.
To fulfill the recognition task of GSMs, there are two key challenges: (i) the lack of a suitable GSM dataset for the task and (ii) how to select an optimal DL model from a great number of deep neural network architectures, such as convolutional neural networks (CNN), deep belief networks [29], recurrent neural networks, and generative adversarial networks, to train and test on this dataset for obtaining a suitable DL model for the task. Therefore, it is necessary to create a GSM dataset first, and then investigate the application of a selected DL model in the recognition of GSMs to verify our hypothesis. In this study, we attempted to compare the accuracy and performance metrics of one DL model, GoogLeNet, with those of three traditional ML algorithms, i.e., Support Vector Machine (SVM), Decision Tree (DT), and K-Nearest Neighbor (KNN). The aim of this study is to explore the feasibility of these four models and find the best model by this comparison. More specifically, we further compared different training strategies of the DL model, GoogLeNet, for obtaining an optimal one to fulfill the GSM recognition task with a better performance. It is desired to provide reference information to help promote further research on the application of state-of-the-art DL in the geothermics domain.
The main contributions of the present study consist of the following four aspects.
  • A novel dataset for recognizing the GSMs, called JiaYing University Geothermal Surface Manifestation (JYU-GSM) dataset, was manually created by visual interpretation with expert knowledge and a high-quality check.
  • It is the first attempt to compare the applications of DL and ML models in the recognition of GSMs in the geothermics and AI domains.
  • A retrained DL model, namely the GoogLeNet deep neural network model (GSM-Net), for the recognition of the GSMs is obtained by using transfer learning and finetuning of GoogLeNet.
  • It is found that there is high feasibility to use a pretrained GoogLeNet model to fulfill the task of the recognition of GSMs.
The rest of this paper is organized as follows. The materials and methods used are introduced in Section 2. In Section 3, the results are presented and analyzed. In Section 4, several influencing factors of DL accuracy are discussed, followed by the analyses of limitations and future work. The paper concludes with a summary in Section 5.
The overall workflow chart for the present study is shown in Figure 1, indicating our main research ideas and framework.

2. Materials and Methods

This section introduces the concept and its classification of GSMs briefly at first. Then the workflow of data preparation, preprocessing and dataset creation, three ML (SVM, DT, and KNN) models and one DL (GoogLeNet) model used, assessment metrics applied, and implementation details in the present experiments are presented successively.

2.1. Geothermal Surface Manifestation

GSMs, also known as geothermal leakage indications, are the thermal activities influenced by abnormal subsurface temperatures and exposed to the Earth’s surface. Depending on the reservoir temperatures and discharge rates, these surface manifestations take the forms of warm springs, hot springs, hot-water rivers/lakes/ponds, boiling springs, seeps, fumaroles, geysers, warm/steaming grounds, mud pots/volcanoes, hydrothermal explosion, phreatic explosion craters, zones of acid alteration, volcanic lakes, and so on [24,30,31]. In addition, there are some deposits of silica sinter, travertine, and/or the bedded breccias that surround phreatic craters [31]. The GSMs are crucial for the preliminary survey to determine the geothermal potential in a geothermal field because they can provide important information about the thermal propagation from the subsurface of the Earth.
According to the concept stated above, seven GSM types, namely warm spring, hot spring, geyser, fumarole, mud pot, hydrothermal alteration, and crater lake, as well as one type of none GSM, were determined as the task’s classification categories on the base of expert knowledge and the visible form and identifiability of the features of GSM photographs. Many photographs were taken in Yellowstone National Park in the USA; Fuji Mountain in Japan; Tengchong, Changbaishan, and Tibet volcanic-geothermal areas in China; and other famous geothermal areas. Tens of thousands of photographs were downloaded and processed manually as samples for the recognition of GSMs. Figure 2 shows the example images for each type of GSMs with different sizes. The main features of these eight types of GSMs and one type of none GSM used in the present study are characterized as follows.
  • Warm Spring (WS): It refers to the geothermal water outcrop whose temperature at the spring mouth is significantly higher than the local annual average temperature and no higher than 45 °C. The temperature of springs cannot be directly observed from a photograph. Consequently, the warm spring type is regarded as seeps, or warm-water pools/ponds/lakes/rivers usually without emitting steam from the water surface viewing in a photograph.
  • Hot Spring (HS): Theoretically, a hot spring refers to the geothermal water outcrop whose temperature at the spring mouth is higher than 45 °C and lower than the boiling point of the local surface water. Hot springs are the most visible manifestation of hot-water geothermal systems that transfer heat to the ground surface, from which the reservoir type can be predicted hypothetically [32]. Hot springs include boiling springs and hydrothermal explosions as well in the present study. Whether the water surface emits steam or not is regarded as the sign of distinguishing hot springs from warm springs. A hot spring usually has a phenomenon emitting steam from the water surface while a warm spring often has no such phenomenon. Actually, a warm spring will emit steam in the cold season, which could make confuse to the ML models.
  • Geyser (GE): A geyser is generally a hole within a cone on the Earth’s surface from which hot water and steam are forced out, usually at irregular intervals. The geyser is an obvious indicator of the water domination reservoir. This type usually refers to the geyser spraying water and steam like a fountain in the present study. Otherwise, it is regarded as a mud pod type or hydrothermal alteration type depending on whether there is mud in the pod.
  • Fumarole (FU): A fumarole refers to an opening in or near a volcano or ground surface through which hot gases escape. It is an evident sign of the high-temperature geothermal field. This type includes holes, craters, or grounds spraying gases or smokes but no steam or water. Sometimes, it is difficult to distinguish between gas and steam visually from a photograph. This will make a fumarole and a geyser easy to confuse and difficult to classify manually.
  • Mud Pod (MP): A mud pod depicts mud with pop bubbles because of captured gases like carbon dioxide (CO2). This type of GSMs includes mud pots, pools, and volcanoes.
  • Hydrothermal Alteration (HA): The alteration rock, zone, or deposit is the surface manifestation by contact between rocks and geothermal fluid. Hydrothermal alteration rocks, phreatic explosion craters, and zones of acid alteration, as well as deposits of silica sinter, travertine, and bedded breccias, are included in this type.
  • Crater Lake (CL): A crater lake is referred to as the lake caused by a volcano after its eruption for a long time in a volcanic area, which is different from the other lakes. Its temperature is below the local annual average one, different from warm lakes/ponds. The photographs of many famous crate lakes are downloaded and processed as samples of this type.
  • None GSM (NG): This type includes negative samples for the recognition of the GSMs, which can improve the performance of robustness of a classifier model. It contains a lot of kinds of photographs other than the GSMs such as ordinary lakes, mountains, rivers, animals, plants, fountains, clouds, sky, and smoke from thermal power plants. Some of them may be similar to a certain GSM, which would increase the uncertainty of recognizing the GSMs, but improve the robustness of a model.

2.2. Data Preparation, Preprocessing, and Dataset Creation

It is well known that a high-quality dataset is the prerequisite to achieving a good performance in the DL domain. Hence, we created a novel high-quality image dataset elaborately for this GSM recognition task by hand with expert knowledge and visual interpretation. Figure 3 shows the workflow chart of the data acquirement and preprocessing. The GSM-related keywords (i.e., warm spring, hot spring, geyser, fumarole, mud pot, hydrothermal alteration, crater lake, geothermal, and geothermal surface manifestation) were adopted to search the relevant types of photographs on the Internet. More than 10,000 photograph images were manually or automatically downloaded from the Baidu Image Search Engine (https://image.baidu.com, accessed date from 1 January to 2 February 2022), the Microsoft Bing Search Engine (https://en.bing.com, accessed date from 15 January to 8 February 2022), and a great deal of tourism service websites and tourists’ blogs with free access.
The duplicated photographs and phonographs that do not belong to any type of GSMs (false images) were then manually removed one by one for many rounds. The persons in some photographs (most belong to the warm spring type) were manually masked as much as possible. To avoid the uneven effect of samples of different types, 500 images of each type of GSMs were retained, in total containing 4,000 photographs in the dataset. Afterward, all the 4,000 images were converted from different formats (e.g., PNG, TIFF, JFIF, GIF, WEBP) to JPEG, the most used image file format. Then they were resized to no larger than 448 pixels in both width and height while keeping their ratio unchanged. That is, if the size of a photograph is larger than 448 pixels, it will be reduced to 448 pixels in either width or height; otherwise, be enlarged to 448 pixels. These operations may help fit the input size 224-224-3 of the GoogLeNet DL model. As a result, the image sizes of the eight GSM types were distributed unevenly for spanning a lot in width and height, as shown in Table 1.
Lastly, the preprocessed images were labeled manually into eight types of GSMs based on expert knowledge with a high-quality check. Hence, a novel GSM image dataset, called JiaYing University Geothermal Surface Manifestation photographs dataset, namely the JYU-GSM dataset for short, was established at the end. In order to evaluate the accuracy and performance of the DL model obtained by training on GoogLeNet, the JYU-GSM dataset was divided into three subsets, i.e., training, testing, and validation subsets, according to the ratio of 0.8:0.1:0.1, 0.6:0.2:0.2, or 0.4:0.3:0.3, which is commonly used in the data preprocessing setup of DL. The three ratios were applied to analyze the effect of data division on the performance of the GoogLeNet model. The ratio of 0.8:0.1:0.1 was mainly applied to split the dataset for the accuracy and performance comparison of the DL and ML models used.

2.3. Machine Learning Models

ML is the science of getting computers to act without being explicitly programmed. In recent decades, ML has proven to be a powerful tool for deriving insights from data [33,34]. It has been applied successfully in self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome thanks to many practical algorithms developed. Traditional ML models mainly include KNN, DT, SVM, ANN, random forest, extreme gradient boosting, and naïve Bayesian network. ML is so pervasive that it is believed to be the best way to make progress towards human-level AI.
More recently, the development of DL has yielded further performance improvements thanks to its capacity to extract a variety of features from large datasets. DL, as a new component of ML, has become the most promising direction of ML for its excellent performances in many challenging tasks such as image recognition and detection, text classification, and natural language processing. It should be noted that the detailed explanations of theories and algorithms of ML and DL, which have been expounded in the literature on computer vision and pattern recognition, are beyond the purpose and scope of this paper. We empirically selected three ML models, i.e., KNN, DT, and SVM, briefly introduced below as a comparative study to one DL model, GoogLeNet.
Feature extraction is a crucial step for conventional ML algorithms except for DL models as a critical feature engineering method. The Histogram of Oriented Gradient (HOG) is one commonly used feature descriptor for object recognition and detection in computer vision and image processing. It forms the feature by calculating and counting the gradient direction histogram of the local area of the image. In the present study, the HOG feature combined with the SVM, DT, and KNN classifiers was used to perform the recognition of the GSMs, as shown in Figure 1.

2.3.1. Support Vector Machine

The SVM is a generalized linear classifier that classifies data according to supervised learning. The SVM uses the hinge loss function to calculate empirical risk and adds a regularization term to the solution system to optimize structural risk. It is a sparse and robust classifier. SVM can carry out nonlinear classification through the kernel method, one of the common kernel learning methods. The SVM has been applied in pattern recognition problems in various fields, including image classification [35,36,37], building detection [38], surface-wave separation [39], face recognition [40], cancer recognition [41], etc.
The Standard SVM is an algorithm based on the binary classification problem, which cannot directly deal with the multiple-classification problem. Based on the calculation process of the standard SVM, multiple decision boundaries are orderly constructed to realize multiple classifications of samples. The usual implementation is one-versus-many and one-versus-one. The one-versus-many SVM establishes m decision boundaries for m classifications, and each decision boundary determines the attribution of one type to all other categories. It can calculate all decision boundaries in one iteration by modifying the optimization problem of the standard SVM. The one-versus-one SVM is a voting method to establish decision boundaries for any two m classifications; that is, there are a total of m(m − 1)/2 decision boundaries. The sample category is selected according to the category with the highest score in the discrimination results of all decision boundaries. Since a detailed analysis of the theory of SVM is beyond the scope of this paper, we refer the reader to [40,42] for more detail on SVM. In the present study, the one-versus-one SVM was used as a ML model for the recognition of the eight types of GSMs.

2.3.2. Decision Tree

The DT is a decision analysis method based on the known probability of occurrence of various situations. It is a graphical method of intuitively using probability analysis to calculate the probability that the expected value of net present value is no less than zero, evaluate the project risk, and judge its feasibility. Because this decision-making branch is drawn as a graph, which is similar to the branches of a tree, it is called a DT. The DT is a prediction model representing a mapping relationship between object attributes and values. Entropy, messy degree of the system, is used in ID3, C4.5, and C5.0 algorithms. This measure is based on the concept of entropy in informatics theory.
The DT is a tree structure in which each internal node represents a test on an attribute, each branch represents a test output, and each leaf node represents a category. It is a supervised learning method used widely in remote sensing classification [43,44,45] and more. The supervised learning is that, given a pile of samples, each sample has a set of attributes and a category. These categories are determined in advance, then a classifier can be obtained through learning, and this classifier can predict the correct classification of new objects.

2.3.3. K-Nearest Neighbor

The KNN is a mature statistics-based method in theory and one of the simplest supervised learning algorithms. The idea of this method can be expressed that in the feature space if most of the k-nearest samples near a sample belong to a certain category, the sample also belongs to this category. As the most basic classifier in ML, The KNN can be used for both binary and multiple classifications. It can be used not only for classification, but also for regression. By finding the K-Nearest Neighbors of a sample, the average value of the attributes of these neighbors is assigned to the sample as the predicted value. The algorithm involves three main factors: the training set, the measurement of distance and similarity, and the size of k. The second factor, i.e., distance and similarity, is the primary consideration when using KNN.
Although the KNN has been popularly used in remote sensing classification [35] and more, there are two disadvantages of KNN: (i) too much computation, time-consuming, and memory-consuming model; (ii) if the sample is unbalanced, for example, there are too many labels, the prediction of the labels to be tested will be greatly affected during voting, and the error rate will increase.

2.4. Deep Learning Model: GoogLeNet

DL allows computational models composed of multiple processing layers to learn data representations with multiple levels of abstraction [46]. As a subset of ML and the core of AI, DL methods have dramatically improved the state-of-the-art performances in speech recognition, visual object recognition, object detection, and many other domains such as drug discovery and genomics [46]. For this reason, a DL model, GoogLeNet, was empirically selected in the present study and is introduced separately. We evaluated the performance of the model for the GSM recognition task as a comparison with three DL algorithms, SVM, DT, and KNN, as shown in Figure 1. Beyond elaborating the principle of CNN, GoogLeNet is solely described below for brevity.
GoogLeNet, the winner of ILSVRC 2014, is a CNN that is 22 layers deep [47]. Table 2 shows the architecture of GoogLeNet [47]. More specifically, GoogLeNet possesses roughly 6.8 million parameters with nine inception modules, two convolutional layers, one convolutional layer for dimension reduction, two normalization layers, four max-pooling layers, one average pooling, one fully connected layer, and a linear layer with Softmax activation in the output. Each inception module, in turn, contains two convolutional layers, four convolutional layers for dimension reduction, and one max-pooling layer (Table 2). GoogLeNet also uses dropout regularization in the fully connected layer and applies the ReLU activation function in all the convolutional layers. In order to avoid the disappearance of the gradient, two Softmax losses in the intermediate layers are added and connected with two auxiliary classifiers in the GoogLeNet architecture, so there are three losses in GoogLeNet. During training, the loss of the whole network is obtained by a weighted addition of the three losses while at inference time, these two middle losses are discarded.
A network pretrained on GoogLeNet was retrained on the JYU-GSM dataset. The network trained on ImageNet1000 classifies images into 1,000 object categories, such as keyboard, mouse, pencil, animal, geyser, fountain, lakeside, and lakeshore, some of which are like the GSM categories semantically but strictly not the same meanings in fact. This ImageNet1000 dataset spans 1,000 object classes and contains 1,281,167 training images, 50,000 validation images, and 100,000 test images [48]. The network has learned different feature representations for a wide range of images but no specific GSM. Thus, it cannot recognize most of the GSMs correctly. The network has an image input size of 224-by-224 in the RGB format. The input images were automatically resized to adapt this size when training. We also resized the testing and validation images to the size. We retrained the GoogLeNet network pretrained on the ImageNet1000 dataset on the JYU-GSM dataset to perform the recognition task using transfer learning. The weights of the tenth layers of the pretrained network were frozen for faster training and the output class number of the fully connected layer were changed from one thousand to eight before training. In the end, a new network fitting the JYU-GSM dataset based on GoogLeNet was obtained. We name it GSM-Net for short, which could be one of our contributions in the present study for the interdisciplinary fields of geothermics and DL.

2.5. Assessment Metrics

In the AI domain, many assessment metrics are used to evaluate the accuracy and performance of a classifier model. Among these, the confusion matrix, precision, recall, F1 score, overall accuracy, and computational time are the popularly adopted metrics in scientific research areas and industrial applications for their great help in understanding the assessment performance of an AI model. These metrics are usually used for the classification assessment of remote sensing imagery.

2.5.1. Confusion Matrix

The confusion matrix, also known as possibility table or error matrix, is a standard format for accuracy evaluation, expressed in a matrix with n rows and n columns. Each column represents the predicted value, and each row represents the actual category. The name comes from the fact that it can easily indicate whether multiple types are confused (that is, one class is predicted to be another class). It is a specific matrix used to present the visualization effect of algorithm performance, usually for supervised learning.
Table 3 shows the confusion matrix for a classic example of binary classification. The confusion matrix of multiple classifications is similar to Table 3. The present study is an example of multiple classifications.
According to Table 3, a precision (P) and recall (R) metric is then defined as Equations (1) and (2), respectively. As shown in Equation (1), the precision denotes the proportion of actual positive samples among all the results predicted as positive samples. As shown in Equation (2), the recall denotes the proportion of the samples predicted as positive examples by the classifier to the actual number of positive examples, also known as sensitivity, describing the sensitivity of the classifier to the category of positive examples.
P = TP/(TP + FP),
R = TP/(TP + FN),
where P and R denote the precision and recall, respectively; TP, FP, and FN indicate the same meanings as in Table 3.
As for the present multiple-classification task, the overall precision (OP) and overall recall (OL) metrics were able to be used to compare the accuracy performance of the four classifiers used, whose computational method is shown in Equations (3) and (4), respectively.
OP = ( i = 1 n P i ) / n ,
OL = ( i = 1 n R i ) / n ,
where OP means the overall precision and OL means the overall recall; Pi and Ri denote the precision and recall of the i type of the GSMs, respectively; n is the total classification number, which equals eight in the present study.

2.5.2. Accuracy and Overall Accuracy

As shown in Equation (5), the accuracy metric denotes the proportion of all correctly predicted samples of a classifier model in the total samples. As for the multiple-classification task, the overall accuracy (OA) metric was specifically used to compare the four classifiers’ performance, which equals the average accuracy percentage mathematically, as shown in Equation (6).
A = (TP + TN)/(TP + TN + FP + FN),
OA = ( ( i = 1 n A i ) / n ) × 100 % ,
where A denotes the accuracy; TP, TN, FP, and FN indicate the same meanings as in Table 3; OA means the overall accuracy; Ai denotes the accuracy of the i type of the GSMs; n is the total classification number, which equals to eight in the present study.

2.5.3. F1 Score and Overall F1 Score

The F1 score is the harmonic mean of precision and recall taking both metrics into account in Equation (7). The overall F1 score (OF) is used for the GSM recognition task to evaluate the accuracy performance of the different classifier models in general, which equals the average F1 scores mathematically, as shown in Equation (8).
F1 = 2 × P × R/(P + R),
OF = ( i = 1 n F 1 i ) / n ,
where F1 denotes the F1 score; P and R denote the precision and recall, respectively; OF means the overall F1 score; F1i denotes the F1 score of the i type of the GSMs; and n is the total classification number, which equals eight in the present study.

2.5.4. ROC and AUC

ROC is short for receiver operating characteristic. The ROC curve is the main analysis tool that is drawn on a two-dimensional plane, also called the sensitivity curve. The x-coordinate of the plane is the false positive rate, and the y-coordinate is the true positive rate. It can visually represent the performance of a classifier algorithm. The steeper the curve, the better is the performance of the algorithm.
AUC is short for the area under the curve, which is a comprehensive number simply implying the performance of a classifier algorithm. The closer the AUC to 1, the better the algorithm’s performance, and vice versa. An AUC value of 0.5 shows that the model has a random distribution, and an AUC value of 1 indicates that the model is entirely consistent with the actual situation. In general, AUC values between 0.7 and 0.9 indicate good accuracy and authenticity, whereas AUC values greater than 0.9 indicate high accuracy. The ROC, together with the AUC, OA, and OF, was adopted to discriminate and compare the performance of the retrained network models from different aspects.

2.5.5. Computational Time

The computational time (CT) is a key performance indicator for DL and ML. It may denote the training, testing, or validation time, even the sum of the two or three processes. The CT is mainly influenced by the hardware, architecture, and size of a model; the input size of an image for the model; the total number of images (size of dataset) for training, testing, and validation; and the training strategy and parameter options, etc. When using the same number of images to train, test, and validate under the same setting environment, the shorter the computational time, the better is the classifier model. The CT was automatically recorded when carrying on the model training, testing, and validation in MATLAB. The CT can be converted to the frames per second, an indicator showing the speed of image recognition.
The confusion matrix was calculated and plotted using MATLAB, where the precision, recall, accuracy, and F1 score were computed simultaneously. Based on the confusion matrix, the OA, OP, OR, and OF metrics were calculated as Equations (3), (4), (6), and (8), respectively, the ROC curve was plotted, and the AUC was then calculated. The OA was generally converted to a percentage as computed in Equation (6). These accuracy metrics reflect the accuracy of image classification from different aspects. It is worth noting that all the assessment metrics vary from 0 to 1 except the OA metric, which varies from 0% to 100%. The closer the metric to 1, the better is the model, and vice versa.
The precision, recall, accuracy, and F1 score metrics were used to evaluate the performance of a classifier model to predict every single type of the GSMs. Meanwhile, the overall accuracy, overall F1 score, and computational time were used to compare the performance of different classifier models in general. In addition, the ROC curve and AUC were also used to visually discriminate the performance of the models used.

2.6. Implementation Details

2.6.1. Computational Environment

The experiment hardware and software environment for computation is shown below, equipped with a good GPU to accelerate computation. The same computational environment is for training the GoogLeNet, SVM, DT, and KNN models.
  • Operating system: Microsoft Windows 10 education version;
  • CPU: Intel(R) Core (TM) i7-7700K, four cores;
  • RAM: Kingston 16 Gb × 3;
  • GPU: NVIDIA GeForce RTX 2080 Ti, 11 Gb GDDR 6;
  • Software: MathWorks MATLAB® 2021a (9.10).

2.6.2. Setup of GoogLeNet

The main experiment parameters for the DL model (GoogLeNet) training are listed below, which are the fine-tuned results.
  • Optimizer: SGDM;
  • MaxEpochs: 6;
  • Shuffle: every-epoch;
  • MiniBatchSize: 16;
  • InitialLearnRate: 5 × 10−4;
  • ValidationFrequency: 100;
  • LearnRateSchedule: piecewise;
  • LearnRateDropFactor: 0.2;
  • LearnRateDropPeriod: 2;
  • ValidationPatience: 3;
  • L2Regularization: 0.0005;
  • Momentum: 0.95;
  • ExecutionEnvironment: GPU.
The training dataset for accuracy and strategy comparisons consists of 3200 images in total, and 400 images for each type of GSMs, respectively. The maximum epoch was set to six and shuffled every epoch. It was an early stop epoch when training for 20 epochs. The number of input images for a mini-batch was set to 16, that is, 200 iterations. The initial learning rate was set to 0.0005, while a dynamic mechanism of learning rate was adopted to update learnable parameters (i.e., weights and biases) using stochastic gradient descent with momentum (SGDM) with an L2 regularization and dropout policy to avoid overfitting. This training policy is commonly used to get better performance of a model in the AI domain. A GPU was used to help train the networks faster. An example of the training progress for the GoogLeNet model is shown in Figure 4, indicating a little bit of overfitting for the training because the curve of validation accuracy is almost under the training curve.

2.6.3. Setup of SVM, DT, and KNN

We used the default setup values for the three traditional ML models (SVM, DT, and KNN) training, which were not optimized in the present study (Table 4). The three models all used the Error-Correcting Output Codes (ECOC) method to train based on onevsone codingname. The SVM model used the hinge loss and the other two models both used the quadratic loss to realize the recognition task of GSMs.

3. Results

In this section, we compared the metrics of accuracy and performance of the trained DL and ML models to evaluate their feasibility for the task of recognition and classification of GSMs. First, visual analyses of the test image presentations were made to check the performance intuitively. Second, the accuracy metrics were computed and compared for the DL and ML models and the two DL strategies as well. Third, the computational time for training, testing, and validation was recorded and analyzed to compare the performance of models. Last, the ROC curves were made to check the performance further and compare the performance comprehensively.

3.1. Visual Comparison and Analysis

A visual comparison was conducted to analyze the four classifier models’ performances by presenting testing results from two randomly selected images of each type. Sixteen images, two for each type of eight categories, were randomly selected from the test subset to conduct a visual analysis for testing the four classifier models. Figure 5 shows one of our random multiple test results that have similar patterns. It can be clearly seen that 15 images were correctly predicted (except one wrong image, No. 2 predicted from WS to HS), all with very high probabilities using the GoogLeNet model and 16 images. On the contrary, nine images were predicted positively using the SVM model, and too many predicted images were mistaken using the other two models. The test OA for these four models is calculated as 93.75% (15/16), 56.25% (9/16), 25.00% (4/16), and 12.50% (2/16), respectively, indicating a possible advantage of the GoogLeNet model over the three ML models, SVM, DT, and KNN.
In these cases, the No. 2 image was predicted negatively by all four models. It is a photograph showing a warm spring with some persons and plenty of heating steam. The characteristic with heating steam or vapor, which belongs mainly to the hot spring type, is probably the reason for this false prediction. Moreover, it is interesting that the GoogLeNet model can positively predict the No. 14 image, which is masked nearly one-third by a stone tablet engraved with the word Tian Chi (Heaven Lake). The GoogLeNet can also accurately recognize the geysers with spraying water or vapor and classify the fumaroles with emitting gases or fumes, similar to the vapor to some extent. These indicate the GoogLeNet model has probably learned some significantly different characteristics of various types of GSMs and has a good performance of robustness and generalization.

3.2. Comparison of DL and ML Accuracies

The assessment results of accuracies of the trained DL and ML models are shown in Table 5. It can be evidently found that the retrained model based on the GoogLeNet transfer learning (for short, the GoogLeNet model) has a significant advantage of both overall accuracy and overall F1 score over the three ML models, SVM, DT, and KNN, no matter the validation or test.
The GoogLeNet model has the highest overall accuracy of 91.25% and the highest overall F1 score of 0.91, both on the validation subset among the four classifiers used, followed by the SVM model with the OA of 53.50% and OF of 0.53, also both on the validation subset. In contrast, the KNN model has the smallest OA and OF, only 20.25% and 0.17, respectively, either on the validation or test subset. For the test subset, the GoogLeNet model also occupies the first position with the highest OA and OF, 88.25% and 0.88, respectively, and the second is the SVM model with the OA of 49.00% and OF of 0.49.
According to Figure 6, the GoogLeNet model completely surpasses the other three ML models by at least 70% over the SVM model, and with the maximum reaching nearly 450% over the KNN model both on the validation subset. This result indicates that the GoogLeNet model has a significant advantage over the three ML models, whether from overall accuracy or overall F1 score, even on the test or validation subset. It suggests that the GoogLeNet model is competent for the recognition of the GSMs. It is worth noting that the three ML models studied could not be used to recognize the GSMs due to their low accuracy for any test or validation. Although the SVM model has a significant advantage over the other two, its OA and OF are too low to meet the essential requirement for classification.

3.3. Comparison of Different DL Strategies

Generally, there are two DL strategies from scratch and transfer learning with a pretrained network. Transfer learning can help train a network faster for a new classification task, especially on a small dataset, which is popularly used in many AI applications such as historical building detection [49], gross domestic product (GDP) prediction [50], and scene classification [51]. Of course, DL from scratch is also widely used in AI when there are plenty of data. These two DL strategies were carried out in the present study to discriminate which is the best one for the task of GSM recognition. The frozen technique of deep CNN layers was also used in the study to explore its effect on DL results.
It can be evidently found from Table 6 and Figure 7 that whatever the validation or test assessment, the accuracy metrics, including OA and OF using transfer learning with the pretrained network (S3 and S4), are both considerably larger than those using the DL from scratch (S1 and S2). The results show that transfer learning with the pretrained DL model has a significant advantage of accuracy over DL from scratch, whether the initial 10 layers were frozen or not.
The highest overall accuracy is 63.0% for the validation when performing the DL from scratch, while the highest one reaches 93.5% for the validation when performing the transfer learning with the pretrained network (Table 6). The lowest overall accuracy reaches 88.8% for the test when performing the transfer learning with the pretrained network, which is larger than the highest one by increasing the percentage 25.8% when performing the DL from scratch, indicating that transfer learning with a pretrained network could be a preferred choice for the recognition of the GSMs. The highest overall F1 score is 0.67 for the test when performing DL from scratch, while the highest one reaches 0.94 for the validation when performing the transfer learning with the pretrained network. The lowest overall F1 score goes to 0.89 for the test when performing the transfer learning with the pretrained network, which is larger than the highest one by increasing 0.22 when performing DL from scratch, indicating the same advantage from the view of the overall accuracy.
The overall accuracies of the transfer learning validation and test are all largely higher than those of DL by the increased percentage of at least 44%. Meanwhile, all overall F1 scores of the transfer learning validation and test are also tremendously larger than those of DL, by the increased percentage of at least 32% (Figure 8). For the overall accuracy, the highest increase in percentage reaches 70.00% for the validation between the transfer learning with the pretrained network, with no frozen layer and the DL from scratch with the initial 10 layers frozen (S4–S1), while the highest one gets to 61.87% for the test between the transfer learning with the pretrained network and the initial 10 layers frozen and the DL from scratch, with the initial 10 layers frozen (S3–S1). As for the overall F1 scores, the highest increase percentage goes up to 74.07% for the validation between the transfer learning with the pretrained network and no frozen layer and the DL from scratch with the initial 10 layers frozen (S4–S1) while the highest one gets to 64.81% for the test between the transfer learning with the pretrained network and the initial 10 layers frozen and the DL from scratch with the initial 10 layers frozen (S3–S1). In short, transfer learning with the pretrained GoogLeNet model has a significant advantage of accuracy over DL from scratch, whether the initial 10 layers were frozen or not.

3.4. ROC and AUC Comparison

Apparently, it can be observed that the mean ROC curve of the GoogLeNet model for the eight types of the GSMs is the steepest among the four models, followed by the SVM model (Figure 9a,b). It suggests that the GoogLeNet model is the best among the models used. The mean AUC of 0.9954 of the GoogLeNet model is also the biggest among the four models, indicating its first place. The performance of the GoogLeNet model to classify the eight different types of the GSMs was further assessed with the ROC curve on the test subset. The results show that it is difficult to distinguish which type of GSMs has the best performance because the ROC curves of all types are too steep to separate, and the average AUCs of all types almost approach 1 (Figure 9b). It suggests that the GoogLeNet model has high accuracy for the recognition of each type of the GSMs.

3.5. Comparison of Computational Time

3.5.1. Time of Different Models

It took the longest time for the KNN (515.97 s) model to train on the training images, followed by the GoogLeNet model (299.38 s), and it took the least time for the DT model (78.68 s). It took the longest time for the KNN model to test and validate, followed by the SVM model, and the least time for the GoogLeNet model (Table 7). It suggests that the KNN model has the worst performance. Although it took more time for the GoogLeNet model to train than the SVM and DT models, the time for both the test and validation was the least, respectively, implying that the GoogLeNet model can predict an image the fastest once it is fully trained. The frames per second of the GoogLeNet model reaches 29.24, high enough to perform real-time recognition. Combined with its highest accuracy stated above, the GoogLeNet model has high feasibility to accomplish the task of the recognition of the GSMs.

3.5.2. Time of Different Strategies

Since the GoogLeNet model is the best one among the four models used, in order to explore the training strategy, four strategies, were designed to train the GoogLeNet model on the same data subset with the same parameters: S1—from scratch, initial 10 layers frozen; S2—from scratch, no frozen layer; S3—pretrained, initial 10 layers frozen; and S4—pretrained, no frozen layer. The results show the strategy S4 took the least time (273.20 s) to train, followed by the strategy S3 (279.20 s), and the longest (307.73 s) was the strategy S2 (Table 8). As expected, the training time for strategy S1 is shorter than strategy S2, and the training time for strategies S3 and S4 is shorter than that for strategy S1 and S2, respectively. However, it is surprising that the training time for the strategy S3 is longer than that for the strategy S4 because the weights and biases of its initial 10 layers are frozen and do not need training. The computational time for the test is in line with the expectation, but the time for validation is not so, since the time for the strategy with layers frozen is shorter than that of no frozen layer when using the same strategy from scratch or transfer learning with the same pretrained model. This is probably because the sizes of the input images for training, testing, and validation vary greatly.

3.6. Comparison of Accuracies of Different Types of GSMs

As demonstrated above, the GoogLeNet model obtained the best accuracy and performance among the four selected models. Hence, we analyzed in detail the confusion matrix and assessment metrics of the eight types of GSMs from the GoogLeNet model on the testing subset. The results are shown in Table 9 and Table 10, respectively.
It can be found that the precision (95.92%) of the geyser type outperforms those of the other seven types, and the precision (82.67%) of the NG type is the least among all types (Table 9). The recall (94.00%) of the geyser type is the best among all types, and the least falls to the fumarole type. The warm spring type was predicted with 88.00% for both the precision and recall, resulting in an F1 score of 0.88 (Table 10).
For the 50 WS samples, they were predicted to be one HS, one MP, two HA, and two CL types, confused by four types of GSMs, while there were four HS and two NG predicted to be WS. The HS samples were predicted to be four WS, one MP, and two NG, confused by three types of GSMs, while four types (WS, FU, MP, and HA) of samples were predicted as HS, resulting in an F1 score of 0.84, which is the least among those of all types. The GE samples were confused solely by two FU samples, while the FU samples were confused by three GE samples. The GE type has the best precision and recall and the best F1 score (0.95). However, the FU type has a relatively high precision but the lowest recall, leading to an F1 score of 0.87. The 46 MP samples were predicted positively. Further, two were predicted to HS, one to HA, and one to NG negatively, bringing about a high recall of 92.00%, while one WS, one HS, two FU, and two NG samples were predicted falsely to be the MP type, causing a precision of 88.46%. The F1 score for the MP type is thus calculated as 0.90. Accordingly, the F1 score for the HA, CL, and NG types is calculated as 0.91, 0.92, and 0.84, respectively. The overall precision, recall, accuracy, and F1 score reach 89.14%, 89.00%, 89.00%, and 0.89, respectively, for the GoogLeNet model on the testing subset, with 50 image samples of each type.
The reasons for these confusions of different types are probably caused by the coexisting of various GSMs (e.g., Figure 2a) in a photograph and some human mistakes resulting from the difficulty of visual interpretation and classification. The trained models were confused by some ambiguous photographs with two or more coexisting types of GSMs. Therefore, the discriminative features extracted by the model were also confused to some extent. It suggests that it should be very important to create a high-quality dataset for the feature extraction of an AI task.

4. Discussion

As stated above, the present study confirmed that deep transfer learning model outperforms the three traditional ML models. Thus, the effects of different data division methods, data augmentation, and hyperparameter optimization on the DL accuracy metrics are emphatically discussed below without mentioning the three ML models. The limitations and future work are explained and analyzed as well.

4.1. Effect of Data Division

Generally, the larger the number of examples, the better the performance of the model. As shown in Table 11, it is clearly seen that scenario A (a ratio of 0.8:0.1:0.1 for the divisions of training, testing, and validation) outperforms the other two, B (a ratio of 0.6:0.2:0.2) and C (a ratio of 0.4:0.3:0.3), according to both overall accuracy and overall F1 score for both validation and test, indicating the more the training images, and the better the accuracy of the model. Data augmentation and/or more images with high quality should be used to further improve the accuracy of the GSM-Net.

4.2. Effect of Data Augmentation

Data augmentation is commonly used to improve the accuracy of a model in the DL area. There are many image processing methods such as flip, rotation, and translation for data augmentation. We investigated the effect of data augmentation of image flip to verify the finding stated above in Section 4.1 by training three times to avoid errors of randomness. The division ratio of the JYU-GSM dataset is set to 0.8:0.1:0.1 for the training (400 images), testing (50 images), and validation (50 images) subsets for this investigation. All models were tested and validated on the same test and validation subsets, respectively.
As shown in Table 12 and Figure 10, it can be evidently found that the vertical flip (2x) has a positive effect to increase the accuracy of both test and validation, while the horizontal flip (2x) and the horizontal flip after the vertical flip (4x) methods probably have negative effects on the cross-validation accuracies. The overall accuracy rises up from (88.75 ± 0.90)% and (88.83 ± 1.66)% for no augmentation to (89.33% ± 0.63)% and (89.67 ± 0.52)% for the augmentation 2x (the vertical flip) method for validation and test (Table 12), increasing 0.94% and 0.89%, respectively; A similar trend appears in the overall F1 score (Figure 10b). The standard deviations of both overall accuracy and overall F1 score metrics become smaller, indicating the vertical flip (2x) augmentation could increase the model’s prediction robustness besides promoting its accuracy performance. By contrast, a negative effect occurs on the accuracy metrics of the test and validation results when using horizontal and vertical flips (4x), which is beyond our expectations. The overall accuracy of the augmentation 4x method for validation and test goes down to (86.50 ± 1.39)% and (88.17 ± 0.88)%, decreasing by 1.49% and 1.43%, respectively. A similar trend also appears in the overall F1 score for both validation and test (Figure 10b). With the increase of the number of the training images with data augmentation, the overall accuracy and overall F1 score will not always go up. The reason for this is probably that some types of images of the training dataset, such as hot springs, geysers, and fumaroles, should be input into the model vertically for training. Otherwise, they could influence the accuracy performance of the model negatively.
To verify this speculation, the additional data subsets of the vertical flip (1x), horizontal flip (1x), and horizontal flip (2x, including the original training subset) were trained to obtain three DL models, respectively. These three models were validated and tested on the same validation and test data subsets as the original ones. As shown in Table 12, the accuracy metrics for the validation and test vary a little when using the vertical flip (1x) augmentation, while those for the validation both drop a lot when using the other two augmentation methods with the horizontal flip, those for the test drop a lot when using the horizontal flip (1x), and those for the test rise slightly when using the horizontal flip (2x). This suggests that the horizontal flip augmentation has a negative effect on the accuracy of the DL model used.
The F1 scores of the eight types of GSMs were further analyzed to reveal the effects of different data augmentation methods for a single type of GSMs. It can be found from Figure 11 that the vertical flip (2x) augmentation has a positive effect on the inference accuracy of the DL model used for validation for almost all types of GSMs except mud pods, and it has also a positive one on test for the eight types of GSMs except warm springs and geysers. However, the horizontal flip (2x) and horizontal flip after vertical flip (4x) methods decrease the inference accuracies for validation for all types of GSMs (Figure 11a), but the effects of these two augmentation methods are uncertain for the inference accuracies for test for all types of GSMs. For the test (Figure 11b), the prediction accuracies of WS and NG descend while those of HS, GE, FU, HA, and CL rise when using the horizontal flip (2x), and those of WS, GE, FU, CL, and NG go down while those of HS, MP, and HA rise when using the horizontal flip after vertical flip (4x) method. Therefore, it is necessary to further optimize the JYU-SGM dataset to reduce the uncertainty caused by imbalances in data distribution.

4.3. Effect of Hyperparameter Optimization

First, we tested the effect of the number of epochs. We set 20 epochs to train the GoogLeNet model with the strategy S2 while the other hyperparameters were kept the same. It is surprising that the model stopped early at the thirteenth epoch and got an overall accuracy of 60.50% for validation and 59.75% for test. The overall F1 scores were all less than 60% for both the validation and test. It suggests that DL from scratch and with no weights of GoogLeNet cannot satisfy the accuracy requirements for the task of the recognition of the GSMs. Therefore, we used transfer learning to retrain the weighted GoogLeNet model for better results in the study. For this, six epochs were found to be enough to get a good model after retraining the GoogLeNet model with more epochs many times.
Second, we tested the effect of the initial learning rate with different values. The results show that the initial learning rate set to be 5 × 10−4 is an optimal value for good enough accuracy and performance. When the initial learning rate was set to be larger or smaller, the accuracy metrics would go down and be less than 5 × 10−4.
Third, the mini-batch size was tested with a value of 8, 16, and 32, respectively. It is found that the mini-batch size of 16 is the best suitable value for the GoogLeNet model to train to get better accuracy and performance.
The other parameters were also fine-tuned in our study. Their effects were less than those of the initial learning rate and epoch. Thus, they are not discussed in detail here.

4.4. Limitations and Future Work

In the AI domain, big data, algorithms, and computing powers are considered as the three core driving forces of AI. Because of the great advances of these three forces, plenty of significant progress has been made to resurge the breakthrough of AI in the recent decade. In the present study, we first created a new image dataset, equipped a suitable experimental environment, and then empirically selected four models to perform comparative analysis to push forward the advance of geothermal AI, Geothermal 4.0 [6]. The present study is really a challenge for DL in the geothermics domain because there is a lack of a suitable dataset with high quality to train and develop a DL model for the recognition of the GSMs. Despite a lot of hard work, there are inevitable limitations of the dataset creation, DL model selection and design, and hyperparameter optimization. These issues are eternal topics in the field of AI.
First of all, the data preparation and preprocessing method described would require a tremendous amount of manual work, with a corresponding increase in the subjectivity of the assessment and a corresponding increase in the likelihood of inaccuracies. It took us plenty of effort and time to construct the JYU-GSM, which contains 500 images in each type of the eight categories. The key challenge for this dataset task is manually downloading and labeling of sample images of the eight GSM classes for visual classification, which takes great effort. The biggest difficulty lies in the symbiosis of different types of GSMs stated above in Section 3.1. This makes it very difficult to accurately discriminate the unique features of each type and classify the GSMs manually with visual interpretation. Although we double-checked and confirmed the classification of the JYU-GSM, there might be some misclassifications that could result in the decline of accuracy. Hence, it deserves further study on the improvement of the JYU-GSM and specific design of DL architecture that fits with the problem of GSM for better accuracy.
It can be clearly found from Table 6, Figure 8 and Figure 10 that all assessment results for the validation and test subsets have minor gaps between them, implying that the distribution characteristics of the JYU-GSM should be a little unbalanced. This phenomenon would influence the overall accuracy when shuffling the input training images in each epoch. As a matter of fact, the validation and test results will fluctuate slightly up and down at 90% for the OA or 0.9 for the OF. It is suggested that the JYU-GSM should be updated with higher quality so that it could be trained to get a more optimal DL model with better performance, which deserves further study in detail.
It is worthy to point out that the frozen-layer strategy may have few effects on the model’s accuracy, especially for transfer learning whose accuracies vary no more than 5% (Figure 8, S4–S3), whether using the strategy or not. It is observed that the retrained models using the strategy of DL from scratch have not converged or got a good enough result. More epochs and more images need to be adopted to train and test for better performance in our next work.
Further optimization selection of the ML algorithms could also be a way to fulfill the task of the recognition of the GSMs. In addition, more DL pretrained models, such as VGG16, ResNet-50, and SqueezeNet, with different CNN architectures used widely in image recognition deserve further research on exploring their feasibilities for the task of the recognition of the GSMs in the future. Furthermore, how to use these CNN models for other tasks, such as geothermal anomaly and target area detection in the geothermics domain, will be our next major work in the coming years.
A recent study indicates that 3He/4He analysis of thermal springs locates the mantle suture in continental collision [52]. State-of-the-art methods like GIS and remote sensing, ML and DL (e.g., ANN, CNN, and transformer), and other photographic technology could lead to improved methods of geothermal data analysis and analytical model building. They could make geothermal exploration and evaluation more efficient and deserve comprehensive integration investigations in the future to quickly and accurately find more GSMs, such as thermal springs, to help promote broader applications in similar areas of earth sciences.

5. Conclusions

In the study, we investigated the application of DL in the geothermics domain compared with traditional ML, specifically aiming to explore the feasibility of DL in the recognition of GSMs. We created a new image dataset of the JiaYing University Geothermal Surface Manifestation photographs, namely the JYU-GSM dataset, and compared the accuracy and performance of one DL model, GoogLeNet, with three traditional ML models, i.e., SVM, DT, and KNN. The results show that the GoogLeNet model outperforms the SVM, DT, and KNN models significantly. The model retrained by using the pretrained GoogLeNet can be suitable for the task of the recognition of the GSMs for its high accuracies, while the three traditional ML models are not suitable enough to fulfill this task due to their relatively low accuracies. In conclusion, it is very feasible to use deep transfer learning for the recognition of GSMs in the geothermics domain once a high-quality GSM dataset is available.

Author Contributions

Conceptualization, Y.X. and M.Z.; methodology, Y.X. and M.Z.; software, Y.X. and K.H.; validation, M.Z., Y.L. and J.L.; formal analysis, Y.X. and K.H.; investigation, Y.X., M.Z., Y.C. and J.L.; resources, Y.C. and J.L.; data curation, Y.X., Y.L., J.L. and M.Z.; writing—original draft preparation, Y.X. and M.Z.; writing—review and editing, Y.X., K.H. and M.Z.; visualization, Y.L.; supervision, Y.X.; project administration, Y.X. and M.Z.; funding acquisition, Y.X., M.Z. and K.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangdong Natural Science Foundation, grant numbers 2017A030307040 and 2020A1515010702; the Guangdong Province Special Project in Key Fields for Universities (New Generation Information Technology), grant number 2020ZDZX3044; the Ordinary University Characteristic Innovation Project of Guangdong Province, grant number 2020KTSCX140; the Research Ability Improvement Project of Key Construction Disciplines in Guangdong Province, grant number 2021ZDJS073; and the Intangible Cultural Heritage Research Base Project of Guangdong province, grant number 17KYKT13. This study was partly supported by the NSFC, grant number 61976104.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The JYU-GSM dataset supporting reported results can be found and downloaded for free at https://doi.org/10.5281/zenodo.6220526, starting on 22 February 2022. The JYU-GSM dataset and the MATLAB codes for the four machine learning algorithms (SVM, DT, KNN, and GoogLeNet) used during the current study are also available from the corresponding author (Y.X.) upon reasonable request.

Acknowledgments

Y.X. would like to acknowledge the support from the China Scholarship Council (Grant number 201808440171). Y.X. would like to thank the University of Hawaii at Manoa for sharing the usage of MATLAB licenses. The authors would like to thank Baidu Inc. (https://image.baidu.com, accessed on 22 February 2022), Microsoft Inc. (https://en.bing.com, accessed on 22 February 2022), and a great deal of tourism service websites and tourists’ personal blogs for providing the photographs to download for free. The authors would like to thank the editors and reviewers for their valuable comments and suggestions that improve greatly our manuscript and also to thank the Research Square for posting our manuscript preprint online at https://doi.org/10.21203/rs.3.rs-1377072/v2, accessed on 22 February 2022.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in the text of this manuscript.
AIArtificial Intelligence
ANNArtificial Neural Network
AUCArea Under the Curve
CLCrater Lake
CNNConvolutional Neural Network
CTComputational Time
DLDeep Learning
DTDecision Tree
FPSFrames Per Second (fps)
FUFumarole
GEGeyser
GISGeographic Information System
GPUGraphic Processing Unit
GSMGeothermal Surface Manifestation
GSM-NetGeothermal Surface Manifestation Deep Neural Network Model
HAHydrothermal Alteration
HOGHistogram of Oriented Gradient
HSHot Spring
JYU-GSMJiaYing University Geothermal Surface Manifestation
KNNK-Nearest Neighbor
MLMachine Learning
MPMud Pod
NGNone GSM (Geothermal Surface Manifestation)
OAOverall Accuracy
OFOverall F1 score
ROCReceiver Operating Characteristic
SVMSupport Vector Machine
WSWarm Spring

References

  1. Gupta, H.; Roy, S. Geothermal Energy: An Alternative Resource for the 21st Century; Elsevier Science: Amsterdam, The Netherlands, 2007; p. 279. [Google Scholar]
  2. Huang, S. Geothermal energy in China. Nat. Clim. Change 2012, 2, 557–560. [Google Scholar] [CrossRef]
  3. Pang, Z.; Huang, S.; Hu, S.; Zhao, P.; He, L. Geothermal studies in China: Progress and prospects 1995–2014. Chin. J. Geol. 2014, 49, 719–727. [Google Scholar]
  4. Wang, J. Geothermics Ant Its Applications; Science Press: Beijing, China, 2015; p. 548. [Google Scholar]
  5. Duoji; Wang, G.; Zheng, K. Study on the Development and Utilization Strategy of Geothermal Resources in China; Science Press: Beijing, China, 2017; p. 148. [Google Scholar]
  6. Muther, T.; Syed, F.I.; Lancaster, A.T.; Salsabila, F.D.; Dahaghi, A.K.; Negahban, S. Geothermal 4.0: AI-Enabled geothermal reservoir development-current status, potentials, limitations, and ways forward. Geothermics 2022, 100, 102348. [Google Scholar] [CrossRef]
  7. Sedara, S.O.; Alabi, O.O. Heat flow estimation and quantification of geothermal reservoir of a basement terrain using geophysical and numerical techniques. Environ. Earth Sci. 2022, 81, 70. [Google Scholar] [CrossRef]
  8. Zhang, X.; Hu, Q. Development of Geothermal Resources in China: A Review. J. Earth Sci.-China 2018, 29, 452–467. [Google Scholar] [CrossRef]
  9. Zhou, W.; Hu, X.; Yan, S.; Guo, H.; Chen, W.; Liu, S.; Miao, C. Genetic Analysis of Geothermal Resources and Geothermal Geological Characteristics in Datong Basin, Northern China. Energies 2020, 13, 1792. [Google Scholar] [CrossRef] [Green Version]
  10. Peng, C.; Pan, B.; Xue, L.; Liu, H. Geophysical survey of geothermal energy potential in the Liaoji Belt, northeastern China. Geotherm. Energy 2019, 7, 14. [Google Scholar] [CrossRef]
  11. He, L.; Chen, L.; Dorji; Xi, X.; Zhao, X.; Chen, R.; Yao, H. Mapping the Geothermal System Using AMT and MT in the Mapamyum (QP) Field, Lake Manasarovar, Southwestern Tibet. Energies 2016, 9, 855. [Google Scholar] [CrossRef] [Green Version]
  12. Zhang, G.; Liu, C.; Liu, H.; Jin, Z.; Han, G.; Li, L. Geochemistry of the Rehai and Ruidian geothermal waters, Yunnan Province, China. Geothermics 2008, 37, 73–83. [Google Scholar] [CrossRef]
  13. Du, J.; Liu, C.; Fu, B.; Ninomiya, Y.; Zhang, Y.; Wang, C.; Wang, H.; Sun, Z. Variations of geothermometry and chemical-isotopic compositions of hot spring fluids in the Rehai geothermal field, southwestern China. J. Volcanol. Geoth. Res. 2005, 142, 243–261. [Google Scholar] [CrossRef]
  14. Minissale, A.A. A simple geochemical prospecting method for geothermal resources in flat areas. Geothermics 2018, 72, 258–267. [Google Scholar] [CrossRef]
  15. Chan, H.; Chang, C.; Dao, P.D. Geothermal Anomaly Mapping Using Landsat ETM+ Data in Ilan Plain, Northeastern Taiwan. Pure Appl. Geophys. 2018, 175, 303–323. [Google Scholar] [CrossRef] [Green Version]
  16. Calvin, W.M.; Littlefield, E.F.; Kratt, C. Remote sensing of geothermal-related minerals for resource exploration in Nevada. Geothermics 2015, 53, 517–526. [Google Scholar] [CrossRef] [Green Version]
  17. Coolbaugh, M.F.; Kratt, C.; Fallacaro, A.; Calvin, W.M.; Taranik, J.V. Detection of geothermal anomalies using Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) thermal infrared images at Bradys Hot Springs, Nevada, USA. Remote Sens. Environ. 2007, 106, 350–359. [Google Scholar] [CrossRef]
  18. Hellman, M.J.; Ramsey, M.S. Analysis of hot springs and associated deposits in Yellowstone National Park using ASTER and AVIRIS remote sensing. J. Volcanol. Geoth. Res. 2004, 135, 195–219. [Google Scholar] [CrossRef]
  19. Xiong, Y.; Chen, F.; Huang, S. Application of remote sensing technique to the identification of geothermal anomaly in Tengchong area, southwest China. J. Chengdu Univ. Technol. 2016, 43, 109–118. [Google Scholar]
  20. Zhang, Y.; Zhang, Y.; Yu, H.; Li, J.; Xie, Y.; Lei, Z. Geothermal resource potential assessment of Fujian Province, China, based on geographic information system (GIS) -supported models. Renew. Energ. 2020, 153, 564–579. [Google Scholar] [CrossRef]
  21. Tende, A.; Aminu, M.; Gajere, J. A spatial analysis for geothermal energy exploration using bivariate predictive modelling. Sci. Rep. 2021, 11, 19755. [Google Scholar] [CrossRef]
  22. Wardoyo, G.; Pratama, H.; Sutopo, T.; Ashat, A.; Yudhistira, Y. Application of Artificial Intelligence in Forecasting Geothermal Production. IOP Conf. Ser. Earth Environm. Sci. 2021, 732, 012022. [Google Scholar] [CrossRef]
  23. Assouline, D.; Mohajeri, N.; Gudmundsson, A.; Scartezzini, J. A machine learning approach for mapping the very shallow theoretical geothermal potential. Geotherm. Energy 2019, 7, 19. [Google Scholar] [CrossRef] [Green Version]
  24. Gentana, D.; Sulaksana, N.; Sukiyah, E.; Yuningsih, E. Morphotectonics of Mount Rendingan Area Related To the Appearances of Geothermal Surface Manifestations. Indones. J. Geosci. 2019, 6, 291–309. [Google Scholar] [CrossRef]
  25. Freski, Y.R.; Hecker, C.; van der Meijde, M.; Setianto, A. The effects of alteration degree, moisture and temperature on laser return intensity for mapping geothermal manifestations. Geothermics 2021, 97, 102250. [Google Scholar] [CrossRef]
  26. Dramsch, J.S. 70 years of machine learning in geoscience in review. Adv. Geophys. 2020, 61, 1–55. [Google Scholar]
  27. Gangwani, P.; Soni, J.; Upadhyay, H.; Joshi, S. A Deep Learning Approach for Modeling of Geothermal Energy Prediction. Int. J. Comput. Sci. Inf. Secur. 2021, 18, 62–65. [Google Scholar]
  28. Shahdi, A.; Lee, S.; Karpatne, A.; Nojabaei, B. Exploratory analysis of machine learning methods in predicting subsurface temperature and geothermal gradient of Northeastern United States. Geotherm. Energy 2021, 9, 1–22. [Google Scholar] [CrossRef]
  29. Yang, W.; Xiao, C.; Zhang, Z.; Liang, X. Identification of the formation temperature field of the southern Songliao Basin, China based on a deep belief network. Renew. Energ. 2022, 182, 32–42. [Google Scholar] [CrossRef]
  30. Xu, S.; Guo, Y. The Basis of Geothermics; Science Press: Beijing, China, 2009; p. 207. [Google Scholar]
  31. Wohletz, K.; Heiken, G. Volcanology and Geothermal Energy; University of California Press: Berkeley, CA, USA, 1992. [Google Scholar]
  32. White, D.E. Characteristics of geothermal resources. In Proceedings of the Annual Meeting of the American Geophysical Union, Washington, DC, USA, 16–20 April 1973; Volume 54, p. 4. [Google Scholar]
  33. Donti, P.L.; Kolter, J.Z. Machine Learning for Sustainable Energy Systems. Annu. Rev. Env. Resour. 2021, 46, 719–747. [Google Scholar] [CrossRef]
  34. Ribeiro, A.M.N.C.; Do Carmo, P.R.X.; Endo, P.T.; Rosati, P.; Lynn, T. Short- and Very Short-Term Firm-Level Load Forecasting for Warehouses: A Comparison of Machine Learning and Deep Learning Models. Energies 2022, 15, 750. [Google Scholar] [CrossRef]
  35. Melgani, F.; Bruzzone, L. Classification of hyperspectral remote sensing images with support vector machines. IEEE Trans. Geosci. Remote 2004, 42, 1778–1790. [Google Scholar] [CrossRef] [Green Version]
  36. Lizarazo, I. SVM-based segmentation and classification of remotely sensed data. Int. J. Remote Sens. 2008, 29, 7277–7283. [Google Scholar] [CrossRef]
  37. Xiong, Y.; Zhang, Z.; Chen, F. Comparison of Artificial Neural Network and Support Vector Machine Methods for Urban Land Use/Cover Classifications from Remote Sensing Images: A Case Study of Guangzhou, South China. In Proceedings of the 2010 International Conference on Computer Application and System Modeling (ICCASM 2010), IEEE Xplore, Taiyuan, China, 22–24 October 2010; Volume 13. [Google Scholar]
  38. Turker, M.; Koc-San, D. Building extraction from high-resolution optical spaceborne images using the integration of support vector machine (SVM) classification, Hough transformation and perceptual grouping. Int. J. Appl. Earth Obs. 2015, 34, 58–69. [Google Scholar] [CrossRef]
  39. Li, J.; Chen, Y.; Schuster, G. Separation of Multi-mode Surface Waves by Supervised Machine Learning Methods. Geophys. Prospect. 2019, 68, 1270–1280. [Google Scholar] [CrossRef]
  40. Li, W.; Liu, L.; Gong, W. Multi-objective uniform design as a SVM model selection tool for face recognition. Expert Syst. Appl. 2011, 38, 6689–6695. [Google Scholar] [CrossRef]
  41. Garg, A.; Vijayaraghavan, V.; Mahapatra, S.S.; Tai, K.; Wong, C.H. Performance evaluation of microbial fuel cell by artificial intelligence methods. Expert Syst. Appl. 2014, 41 Pt 1, 1389–1399. [Google Scholar] [CrossRef]
  42. Chi, W.H.; Chi, J.L. A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Net. 2009, 13, 415–425. [Google Scholar]
  43. Xiong, Y.; Wang, R.; Li, Z. Extracting land use/cover of mountainous area from remote sensing images using artificial neural network and decision tree classifications: A case study of Meizhou, China. In Proceedings of the 2010 International Symposium on Intelligence Information Processing and Trusted Computing (IPTC 2010), IEEE Computer Society, Huanggang, China, 28–29 October 2010; pp. 133–136. [Google Scholar]
  44. Otukei, J.R.; Blaschke, T. Land cover change assessment using decision trees, support vector machines and maximum likelihood classification algorithms. Int. J. Appl. Earth Obs. 2010, 12 (Suppl. S1), S27–S31. [Google Scholar] [CrossRef]
  45. Tooke, T.R.; Coops, N.C.; Goodwin, N.R.; Voogt, J.A. Extracting urban vegetation characteristics using spectral mixture analysis and decision tree classifications. Remote Sens. Environ. 2009, 113, 398–407. [Google Scholar] [CrossRef]
  46. LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
  47. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar]
  48. Stanford Vision and Learning Lab, ImageNet. Available online: https://image-net.org (accessed on 20 January 2022).
  49. Xiong, Y.; Chen, Q.; Zhu, M.; Zhang, Y.; Huang, K. Accurate detection of historical buildings using aerial photographs and deep transfer learning. In Proceedings of the IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2020), Waikoloa, HI, USA, 26 September–2 October 2020; pp. 1592–1595. [Google Scholar]
  50. Kumar, S.; Muhuri, P.K. A novel GDP prediction technique based on transfer learning using CO2 emission dataset. Appl. Energ. 2019, 253, 113476. [Google Scholar] [CrossRef]
  51. Pires De Lima, R.; Marfurt, K. Convolutional Neural Network for Remote-Sensing Scene Classification: Transfer Learning Analysis. Remote Sens. 2019, 12, 86. [Google Scholar] [CrossRef] [Green Version]
  52. Klemperer, S.L.; Zhao, P.; Whyte, C.J.; Darrah, T.H.; Crossey, L.J.; Karlstrom, K.E.; Liu, T.; Winn, C.; Hilton, D.R.; Ding, L. Limited underthrusting of India below Tibet: 3He/4He analysis of thermal springs locates the mantle suture in the continental collision. Proc. Natl. Acad. Sci. USA 2022, 119, e2113877119. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Overall workflow chart of the present study. JYU-GSM is the abbreviation of the JiaYing University Geothermal Surface Manifestation; HOG is short for Histogram of Oriented Gradients, SVM is short for Support Vector Machine, DT is short for Decision Tree, and KNN is short for K-Nearest Neighbor; GSM-Net is the abbreviation of the geothermal surface manifestation deep neural network model retrained from GoogLeNet; GSMs denotes geothermal surface manifestations.
Figure 1. Overall workflow chart of the present study. JYU-GSM is the abbreviation of the JiaYing University Geothermal Surface Manifestation; HOG is short for Histogram of Oriented Gradients, SVM is short for Support Vector Machine, DT is short for Decision Tree, and KNN is short for K-Nearest Neighbor; GSM-Net is the abbreviation of the geothermal surface manifestation deep neural network model retrained from GoogLeNet; GSMs denotes geothermal surface manifestations.
Energies 15 02913 g001
Figure 2. Example images of the eight types of geothermal surface manifestations (GSMs): (a) warm spring, (b) hot spring, (c) geyser, (d) fumarole, (e) mud pot, (f) hydrothermal alteration, (g) crater lake, and (h) none of GSM.
Figure 2. Example images of the eight types of geothermal surface manifestations (GSMs): (a) warm spring, (b) hot spring, (c) geyser, (d) fumarole, (e) mud pot, (f) hydrothermal alteration, (g) crater lake, and (h) none of GSM.
Energies 15 02913 g002
Figure 3. Workflow chart of the data preparation and preprocessing. JYU-GSM is the abbreviation for the JiaYing University Geothermal Surface Manifestation photographs dataset.
Figure 3. Workflow chart of the data preparation and preprocessing. JYU-GSM is the abbreviation for the JiaYing University Geothermal Surface Manifestation photographs dataset.
Energies 15 02913 g003
Figure 4. Training progress screenshot graph of the GoogLeNet model in MATLAB.
Figure 4. Training progress screenshot graph of the GoogLeNet model in MATLAB.
Energies 15 02913 g004
Figure 5. Test image presentations of different classifier models: (a) The GoogLeNet model, (b) the Support Vector Machine (SVM) model, (c) the Decision Tree (DT) model, and (d) the K-Nearest Neighbor (KNN) model. WS is short for warm spring; HS is short for hot spring; GE is short for geyser; FU is short for fumarole; MP is short for mud pot; HA is short for hydrothermal alteration; CL is short for crater lake, and NG is short for none of the geothermal surface manifestations. The maximum probability of an image predicted to be a type is labeled after the abbreviation by the GoogLeNet model. The digital number before “Label” is the numerical order of an image.
Figure 5. Test image presentations of different classifier models: (a) The GoogLeNet model, (b) the Support Vector Machine (SVM) model, (c) the Decision Tree (DT) model, and (d) the K-Nearest Neighbor (KNN) model. WS is short for warm spring; HS is short for hot spring; GE is short for geyser; FU is short for fumarole; MP is short for mud pot; HA is short for hydrothermal alteration; CL is short for crater lake, and NG is short for none of the geothermal surface manifestations. The maximum probability of an image predicted to be a type is labeled after the abbreviation by the GoogLeNet model. The digital number before “Label” is the numerical order of an image.
Energies 15 02913 g005aEnergies 15 02913 g005b
Figure 6. Assessment metric change comparison of different pairs of models. (a) Overall accuracy; (b) Overall F1 score. GLN denotes the GoogLeNet model; SVM denotes the Support Vector Machine model; DT denotes the Decision Tree model; and KNN denotes the K-Nearest Neighbor model. GLN–SVM means the difference of the accuracy metrics for GLN and SVM, and the others are the same as this meaning.
Figure 6. Assessment metric change comparison of different pairs of models. (a) Overall accuracy; (b) Overall F1 score. GLN denotes the GoogLeNet model; SVM denotes the Support Vector Machine model; DT denotes the Decision Tree model; and KNN denotes the K-Nearest Neighbor model. GLN–SVM means the difference of the accuracy metrics for GLN and SVM, and the others are the same as this meaning.
Energies 15 02913 g006
Figure 7. Assessment metric comparison of different training strategies. (a) Overall accuracy; (b) Overall F1 score. S1 denotes the training strategy from scratch and with the initial 10 layers frozen; S2 denotes the training strategy from scratch and with no frozen layer; S3 denotes the training strategy of the pretrained network and with the initial 10 layers frozen; S4 denotes the training strategy of the pretrained network and no frozen layer.
Figure 7. Assessment metric comparison of different training strategies. (a) Overall accuracy; (b) Overall F1 score. S1 denotes the training strategy from scratch and with the initial 10 layers frozen; S2 denotes the training strategy from scratch and with no frozen layer; S3 denotes the training strategy of the pretrained network and with the initial 10 layers frozen; S4 denotes the training strategy of the pretrained network and no frozen layer.
Energies 15 02913 g007
Figure 8. Assessment metric change comparison of different pairs of training strategies. (a) Overall accuracy; (b) Overall F1 score. S1 denotes the training strategy from scratch and with the initial 10 layers frozen; S2 denotes the training strategy from scratch and with no frozen layer; S3 denotes the training strategy of the pretrained network and with the initial 10 layers frozen; S4 denotes the training strategy of the pretrained network and no frozen layer. S2−S1 means the difference of the accuracy metrics for S2 and S1, and the others are the same as this meaning.
Figure 8. Assessment metric change comparison of different pairs of training strategies. (a) Overall accuracy; (b) Overall F1 score. S1 denotes the training strategy from scratch and with the initial 10 layers frozen; S2 denotes the training strategy from scratch and with no frozen layer; S3 denotes the training strategy of the pretrained network and with the initial 10 layers frozen; S4 denotes the training strategy of the pretrained network and no frozen layer. S2−S1 means the difference of the accuracy metrics for S2 and S1, and the others are the same as this meaning.
Energies 15 02913 g008
Figure 9. Receiver operating characteristic (ROC) curve. (a) Test assessment for the four classifier models. (b) Test assessment for different types of geothermal surface manifestations (GSMs) by the GoogLeNet model. AUC means the area under the curve. TLG denotes transfer learning with the pretrained GoogLeNet; SVM denotes the Support Vector Machine algorithm; DT denotes the Decision Tree algorithm; and KNN denotes the K-Nearest Neighbor algorithm. WS denotes warm spring; HS denotes hot spring; GE denotes geyser; FU denotes fumarole; MP denotes mud pot; HA denotes hydrothermal alteration; CL denotes crater lake; NG denotes none of GSMs; and AVG denotes the average of all AUC values of the eight types of GSMs.
Figure 9. Receiver operating characteristic (ROC) curve. (a) Test assessment for the four classifier models. (b) Test assessment for different types of geothermal surface manifestations (GSMs) by the GoogLeNet model. AUC means the area under the curve. TLG denotes transfer learning with the pretrained GoogLeNet; SVM denotes the Support Vector Machine algorithm; DT denotes the Decision Tree algorithm; and KNN denotes the K-Nearest Neighbor algorithm. WS denotes warm spring; HS denotes hot spring; GE denotes geyser; FU denotes fumarole; MP denotes mud pot; HA denotes hydrothermal alteration; CL denotes crater lake; NG denotes none of GSMs; and AVG denotes the average of all AUC values of the eight types of GSMs.
Energies 15 02913 g009
Figure 10. Assessment metric comparison of the validation and test results with data augmentation. (a) Overall accuracy; (b) Overall F1 score. The division ratio of the JYU-GSM dataset is 0.8:0.1:0.1 for the training (400 images), testing (50 images), and validation (50 images) subsets. OR (1x) denotes the original training subset; VF (2x) denotes the OR (1x) and its vertical flip (2x); HF (2x) denotes the OR (1x) and its horizontal flip (2x); HF_VF (4x) denotes the OR (1x), its vertical flip, and their horizontal flip. All models are tested and validated on the same test and validation subsets, respectively.
Figure 10. Assessment metric comparison of the validation and test results with data augmentation. (a) Overall accuracy; (b) Overall F1 score. The division ratio of the JYU-GSM dataset is 0.8:0.1:0.1 for the training (400 images), testing (50 images), and validation (50 images) subsets. OR (1x) denotes the original training subset; VF (2x) denotes the OR (1x) and its vertical flip (2x); HF (2x) denotes the OR (1x) and its horizontal flip (2x); HF_VF (4x) denotes the OR (1x), its vertical flip, and their horizontal flip. All models are tested and validated on the same test and validation subsets, respectively.
Energies 15 02913 g010
Figure 11. F1 score comparison of different GSMs with data augmentation. (a) Validation and (b) test. The division ratio of the JYU-GSM dataset is 0.8:0.1:0.1 for the training (400 images), testing (50 images), and validation (50 images) subsets. OR (1x) denotes the original training subset; VF (2x) denotes the OR (1x) and its vertical flip (2x); HF (2x) denotes the OR (1x) and its horizontal flip (2x); HF_VF (4x) denotes the OR (1x), its vertical flip, and their horizontal flip. All models are tested and validated on the same test and validation subsets, respectively. WS is short for warm spring; HS is short for hot spring; GE is short for geyser; FU is short for fumarole; MP is short for mud pot; HA is short for hydrothermal alteration; CL is short for crater lake; and NG is short for none of the geothermal surface manifestations.
Figure 11. F1 score comparison of different GSMs with data augmentation. (a) Validation and (b) test. The division ratio of the JYU-GSM dataset is 0.8:0.1:0.1 for the training (400 images), testing (50 images), and validation (50 images) subsets. OR (1x) denotes the original training subset; VF (2x) denotes the OR (1x) and its vertical flip (2x); HF (2x) denotes the OR (1x) and its horizontal flip (2x); HF_VF (4x) denotes the OR (1x), its vertical flip, and their horizontal flip. All models are tested and validated on the same test and validation subsets, respectively. WS is short for warm spring; HS is short for hot spring; GE is short for geyser; FU is short for fumarole; MP is short for mud pot; HA is short for hydrothermal alteration; CL is short for crater lake; and NG is short for none of the geothermal surface manifestations.
Energies 15 02913 g011
Table 1. The size variation and number of the eight types of geothermal surface manifestations.
Table 1. The size variation and number of the eight types of geothermal surface manifestations.
TypeWidth (Pixel)Height (Pixel)Number
1. Warm spring252~448142~448500
2. Hot spring252~448206~448500
3. Geyser210~448193~448500
4. Fumarole277~448142~448500
5. Mud pot280~448150~448500
6. Hydrothermal alteration297~448191~448500
7. Crater lake254~448136~448500
8. None GSM *280~448214~448500
* GSM denotes geothermal surface manifestation.
Table 2. The network architecture of GoogLeNet. Reprinted with permission from Ref. [47]. 2014, Szegedy, et al.
Table 2. The network architecture of GoogLeNet. Reprinted with permission from Ref. [47]. 2014, Szegedy, et al.
TypePatch Size/StrideOutput SizeDepth#1 × 1#3 × 3 * Reduce#3 × 3#5 × 5 Reduce#5 × 5Pool ProjParamsOps
Convolution7 × 7/2112 × 112 × 641 2.7 K34 M
Max pool3 × 3/256 × 56 × 640
Convolution3 × 3/156 × 56 × 1922 64192 112 K360
Max pool3 × 3/228 × 28 × 1920
Inception (3a) 56 × 56 × 25626496128163232159 K128 M
Inception (3b) 56 × 56 × 4802128128192329664380 K304 M
Max pool3 × 3/214 × 14 × 4800
Inception (4a) 14 × 14 × 512219296208164864364 K73 M
Inception (4b) 14 × 14 × 5122160112224246464437 K88 M
Inception (4c) 14 × 14 × 5122128128256246464463 K100 M
Inception (4d) 14 × 14 × 5282112144288326464580 K119 M
Inception (4e) 14 × 14 × 832225616032032128128840 K170 M
Max pool3 × 3/27 × 7 × 8320
Inception (3a) 7 × 7 × 8322256160320321281281072 K54 M
Inception (3a) 7 × 7 × 10242384192384481281281388 K71 M
Avg pool7 × 7/11 × 1 × 10240
Dropout (40%) 1 × 1 × 10240
Linear 1 × 1 × 10001 1000 K1 M
Softmax 1 × 1 × 10000
* The “#3 × 3 reduce” and “#5 × 5 reduce” in the table indicate the number of 1 × 1 convolutions used before 3 × 3 and 5 × 5 convolution operations, respectively.
Table 3. Confusion matrix of binary classification of artificial intelligence.
Table 3. Confusion matrix of binary classification of artificial intelligence.
Confusion MatrixPredicted Label
TrueFalse
actual labelpositiveTP *FP
negativeTNFN
* True positive (TP): the actual category of the sample is positive, and the result predicted by the model is also positive. True negative (TN): the actual category of the sample is negative, and the model predicts it to be negative. False positive (FP): the actual category of the sample is negative, but the model predicts it to be positive. False negative (FN): the actual category of the sample is positive, but the model predicts it as negative.
Table 4. Parameter setup for training the SVM, DT, and KNN models.
Table 4. Parameter setup for training the SVM, DT, and KNN models.
ModelSVM *DTKNN
BinaryLearners1 × 1 Fittemplate1 × 1 Fittemplate1 × 1 Fittemplate
CodingNameonevsoneonevsoneonevsone
FitPosterior000
MethodECOCECOCECOC
Typeclassificationclassificationclassification
BinaryLosshingequadraticquadratic
Surrogaten.p.onn.p.
MaxNumSplitsn.p.1n.p.
NumNeighborsn.p.n.p.5
Standardizen.p.n.p.1
* SVM denotes Support Vector Machine; DT denotes Decision Tree; and KNN denotes K-Nearest Neighbor. ECOC denotes Error-Correcting Output Codes. “n.p.” denotes no such parameter. The division ratio is 0.8:0.1:0.1 for the training (400 images), testing (50 images), and validation (50 images) subsets.
Table 5. Assessment results of accuracies of the GoogLeNet and three other ML models.
Table 5. Assessment results of accuracies of the GoogLeNet and three other ML models.
ModelSubsetOverall Accuracy (%)Overall F1 Score
GoogLeNetvalidation91.250.91
test88.250.88
SVM *validation53.500.53
test49.000.49
DTvalidation26.000.26
test26.000.26
KNNvalidation20.250.17
test20.250.17
* SVM denotes Support Vector Machine; DT denotes Decision Tree; and KNN denotes K-Nearest Neighbor. The division ratio is 0.8:0.1:0.1 for the training (400 images), testing (50 images), and validation (50 images) subsets.
Table 6. Assessment results of different training strategies for deep learning.
Table 6. Assessment results of different training strategies for deep learning.
StrategySubsetOA * (%)Overall F1 Score
S1—from scratch, initial 10 layers frozenvalidation55.00.54
test55.30.54
S2—from scratch, no layer frozenvalidation63.00.62
test57.50.67
S3—pretrained, initial 10 layers frozenvalidation90.80.90
test89.50.89
S4—pretrained, no layer frozenvalidation93.50.94
test88.80.89
* OA denotes overall accuracy. The division ratio is 0.8:0.1:0.1 for the training (400 images), testing (50 images), and validation (50 images) subsets. The same parameters for training were set.
Table 7. Assessment results of computational time of the GoogLeNet and three other ML models.
Table 7. Assessment results of computational time of the GoogLeNet and three other ML models.
ModelTtrain * (s)Tval (s)FPSvalTtest (s)FPStest
GoogLeNet299.381.9825.251.7129.24
SVM121.029.225.427.716.49
DT78.686.887.276.757.41
KNN515.9765.070.7764.580.77
* Ttrain denotes the total computational time for training with the same parameters; Tval is the computational time for validation; and Ttest is the computational time for testing. FPSval denotes frames per second for validation. FPStest denotes frames per second for the test. SVM denotes Support Vector Machine; DT denotes Decision Tree; and KNN denotes K-Nearest Neighbor. The division ratio is 0.8:0.1:0.1 for the training (400 images), testing (50 images), and validation (50 images) subsets.
Table 8. Assessment results of computational time of the GoogLeNet model using different training strategies.
Table 8. Assessment results of computational time of the GoogLeNet model using different training strategies.
StrategyTtrain * (s)Ttest (s)Tval (s)
S1—from scratch, initial 10 layers frozen295.061.501.89
S2—from scratch, no layer frozen307.731.521.97
S3—pretrained, initial 10 layers frozen279.201.641.98
S4—pretrained, no layer frozen273.201.681.84
* Ttrain denotes the total computational time for training with the same parameters; Ttest is the computational time for testing; and Tval is the computational time for validation. The division ratio is 0.8:0.1:0.1 for the training (400 images), testing (50 images), and validation (50 images) subsets.
Table 9. Confusion matrix of the eight types of geothermal surface manifestations for the GoogLeNet model on the testing subset.
Table 9. Confusion matrix of the eight types of geothermal surface manifestations for the GoogLeNet model on the testing subset.
TypePredicted Label
1. WS2. HS3. GE4. FU5. MP6. HA7. CL8. NGRecall (%)
Actual label1. WS *441 122 88.00
2. HS443 1 286.00
3. GE 473 94.00
4. FU 224121 282.00
5. MP 2 461 192.00
6. HA 4 46 92.00
7. CL 46492.00
8. NG2 2124386.00
Precision (%)88.0082.6995.9293.2088.4690.2092.0082.67
* WS is short for warm spring; HS is short for hot spring; GE is short for geyser; FU is short for fumarole; MP is short for mud pot; HA is short for hydrothermal alteration; CL is short for crater lake; and NG is short for none of the geothermal surface manifestations.
Table 10. Assessment metrics of the eight types of geothermal surface manifestations for the GoogLeNet model on the testing subset.
Table 10. Assessment metrics of the eight types of geothermal surface manifestations for the GoogLeNet model on the testing subset.
TypePrecision (%)Recall (%)Accuracy (%)F1 Score
1. WS *88.0088.00 0.8800
2. HS82.6986.00 0.8431
3. GE95.9294.00 0.9495
4. FU93.2082.00 0.8723
5. MP88.4692.00 0.9020
6. HA90.2092.00 0.9109
7. CL92.0092.00 0.9200
8. NG82.6786.00 0.8431
Overall89.1489.0089.000.8901
* WS is short for warm spring; HS is short for hot spring; GE is short for geyser; FU is short for fumarole; MP is short for mud pot; HA is short for hydrothermal alteration; CL is short for crater lake; and NG is short for none of the geothermal surface manifestations.
Table 11. Accuracy comparison of the training results of three scenarios of data division by deep learning with the pretrained GoogLeNet model.
Table 11. Accuracy comparison of the training results of three scenarios of data division by deep learning with the pretrained GoogLeNet model.
ScenarioSubsetOverall Accuracy (%)Overall F1 Score
A-ratio 0.8:0.1:0.1 (400:50:50)validation90.500.9047
test89.750.8975
B-ratio 0.6:0.2:0.2 (300:100:100)validation88.500.8849
test88.750.8871
C-ratio 0.4:0.3:0.3 (200:150:150)validation86.250.8624
test88.250.8827
Table 12. Assessment metric comparison of the cross-validation results with image transformation and data augmentation.
Table 12. Assessment metric comparison of the cross-validation results with image transformation and data augmentation.
No.AugmentationSubsetOverall Accuracy (%)Overall F1 Score
0OR (1x) *validation88.75 ± 0.900.8881 ± 0.0089
test88.83 ± 1.660.8881 ± 0.0167
1VF (1x)validation88.83 ± 0.950.8886 ± 0.0099
test87.67 ± 0.800.8761 ± 0.0081
2HF (1x)validation76.50 ± 3.120.7627 ± 0.0318
test79.75 ± 0.660.7961 ± 0.0061
3VF (2x)validation89.33 ± 0.630.8937 ± 0.0059
test89.67 ± 0.520.8960 ± 0.0054
4HF (2x)validation86.58 ± 0.380.8659 ± 0.0040
test89.50 ± 1.500.8944 ± 0.0158
5HF_VF (4x)validation86.50 ± 1.390.8650 ± 0.0134
test88.17 ± 0.880.88815 ± 0.0084
* The division ratio of the JYU-GSM dataset is 0.8:0.1:0.1 for the training (400 images), testing (50 images), and validation (50 images) subsets. OR (1x) denotes the original training subset; VF (1x) denotes the vertical flip (1x) of the OR (1x); HF (1x) denotes the horizontal flip (1x) of the OR (1x); VF (2x) denotes the OR (1X) and its vertical flip (2x); HF (2x) denotes the OR (1x) and its horizontal flip (2x); HF_VF (4x) denotes the OR (1x), its vertical flip, and their horizontal flips. All models were tested and validated on the same test and validation subsets, respectively. The overall accuracy and overall F1 score are the averages of training three times, respectively, including its standard derivation followed the average value.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Xiong, Y.; Zhu, M.; Li, Y.; Huang, K.; Chen, Y.; Liao, J. Recognition of Geothermal Surface Manifestations: A Comparison of Machine Learning and Deep Learning. Energies 2022, 15, 2913. https://doi.org/10.3390/en15082913

AMA Style

Xiong Y, Zhu M, Li Y, Huang K, Chen Y, Liao J. Recognition of Geothermal Surface Manifestations: A Comparison of Machine Learning and Deep Learning. Energies. 2022; 15(8):2913. https://doi.org/10.3390/en15082913

Chicago/Turabian Style

Xiong, Yongzhu, Mingyong Zhu, Yongyi Li, Kekun Huang, Yankui Chen, and Jingqing Liao. 2022. "Recognition of Geothermal Surface Manifestations: A Comparison of Machine Learning and Deep Learning" Energies 15, no. 8: 2913. https://doi.org/10.3390/en15082913

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop