Exploring Multi-Fidelity Data in Materials Science: Challenges, Applications, and Optimized Learning Strategies

Wang, Ziming; Liu, Xiaotong; Chen, Haotian; Yang, Tao; He, Yurong

doi:10.3390/app132413176

Open AccessArticle

Exploring Multi-Fidelity Data in Materials Science: Challenges, Applications, and Optimized Learning Strategies

by

Ziming Wang

¹,

Xiaotong Liu

^1,2,*

,

Haotian Chen

¹,

Tao Yang

^1,2 and

Yurong He

^2,3

¹

School of Computer, Beijing Information Science and Technology University, Beijing 100101, China

²

Beijing Advanced Innovation Center for Materials Genome Engineering, Beijing Information Science and Technology University, Beijing 100101, China

³

State Key Laboratory of High-Efficiency Utilization of Coal and Green Chemical Engineering, College of Chemistry and Chemical Engineering, Ningxia University, Yinchuan 750021, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(24), 13176; https://doi.org/10.3390/app132413176

Submission received: 19 October 2023 / Revised: 6 December 2023 / Accepted: 7 December 2023 / Published: 12 December 2023

Download

Browse Figures

Versions Notes

Abstract

:

Machine learning techniques offer tremendous potential for optimizing resource allocation in solving real-world problems. However, the emergence of multi-fidelity data introduces new challenges. This paper offers an overview of the definition, applications, data preprocessing methodologies, and learning approaches associated with multi-fidelity data. To validate the algorithms, we examine three widely-used learning methods relevant to multi-fidelity data through the design of multi-fidelity datasets that encompass various types of noise. As we expected, employing multi-fidelity data learning methods yields better results compared to solely using high-fidelity data learning methods. Additionally, considering the inherent various types of noise within datasets, the comprehensive correction strategy proves to be the most effective. Moreover, multi-fidelity learning methods facilitate effective decision-making processes by enabling the combination of datasets from various sources. They extract knowledge from lower fidelity data, improving model accuracy compared to models solely relying on high-fidelity data.

Keywords:

multi-fidelity data; data noise; machine learning; property prediction

1. Introduction

Multi-fidelity (MF) often appears in the field of science and engineering. It is often used to describe a variety of data or models with different precision, which are called MF data and MF models. Generally, in MF data, the quality of a large amount of data with a low acquisition cost is coarse (i.e., low-fidelity (LF) data), while there are also small amounts of data with very high accuracy (i.e., high-fidelity (HF) data). This is the so-called “cost-accuracy trade-off”. In practical applications, multi-fidelity data may not be limited to just two levels. The levels of data could be five [1,2] or even more. However, most of the recent works still largely focus on the performance of models trained on single-fidelity data [3,4,5]. These efforts either face issues from insufficient data or problems with poor data quality [6].

To overcome these problems, many methods are designed to exploit the multi-fidelity data efficiently, such as iterative denoising [1], information fusion [7,8], Bayesian optimization [9,10], etc. Iterative denoising enhances the model’s accuracy by iteratively training on MF data with specific permutations and applying denoising operations on specific data after each iteration. Information fusion algorithms fuse data information by learning the relationships between HF and LF data, effectively leveraging MF data. Bayesian optimization uses the correlation between HF and LF data to reduce the computational cost and improve the prediction efficiency when iteratively searching for the best value.

The systematic exposition and validation of multi-fidelity data learning methods in other fields have been extensively reviewed [11,12,13]. However, within the realm of materials science, there remains a notable absence of an overview and validation of multi-fidelity data learning methods. This paper aims to address this gap by presenting an introduction and validation of multi-fidelity data learning methods, with a specific focus on their application in materials science. The paper highlights the widespread presence of multi-fidelity data in the materials domain, emphasizing that employing multi-fidelity data learning methods in the face of such data scenarios is undoubtedly a promising choice. Additionally, the article introduces several typical multi-fidelity datasets within the materials domain, enabling interested readers to swiftly access them.

The partitioning of multi-fidelity data needs a combination between prior knowledge and quantitative evaluation methods. Through the consideration of the application fields and the methodologies for data acquisition, a categorization of the generation of multi-fidelity data are summarized in Table 1. MF data often appear in the fields of materials science, aerospace science, and mechanical engineering, as shown in Figure 1. In every field, there are often a variety of different calculation results for the same object, and these results may have different accuracy and quantity. We will engage in a section-by-section discussion.

1.1. Multi-Fidelity Data from Different Algorithms

In the field of materials science, the existence of multi-fidelity data is primarily attributed to the diverse methodologies. These approaches often range from experimental procedures to theoretical calculations. Experimental methods are generally considered to be more time-consuming than their theoretical counterparts, but they offer a higher degree of accuracy. Here, for instance, some works solely focus on theoretical methods, and their key results are summarized in Table 2. It is essential to recognize that even within the realm of theoretical computation, fidelity varies significantly. For example, the commonly used empirical potentials are generally considered low-fidelity (LF) methods when compared to density functional theory (DFT), which is often regarded as high-fidelity (HF). However, it is important to note that even within DFT, there exists a ladder of accuracy. The Perdew–Burke–Ernzerhof (PBE) functional, for instance, can sometimes be categorized as high-fidelity depending on the context, while in other cases, it might be considered low-fidelity. There are often systematic errors [54,55,56] and random errors compared to true values when using DFT calculations to determine material properties. For example, it has been reported in the literature that DFT calculations tend to underestimate the band gap width by 30% to 100% [57]. Random errors typically arise from the initial trial charge density used in the DFT calculations. Different trial charge densities can lead to varying random errors. Additionally, different convergence criteria in self-consistency may also introduce random errors. Systematic errors, on the other hand, are often introduced during the calculation of the Hamiltonian in DFT. The Kohn–Sham equations, which are used to compute the exchange–correlation potential and solve for the electronic wave functions, typically employ different approximation methods, in which different approximation methods introduce different errors. All of these errors arise during the DFT calculation process. For the same calculation target, the different errors resulting from different functionals can be viewed as discrepancies between HF and LF. These discrepancies can be corrected using multi-fidelity data learning methods.

In addition to different algorithm applications in materials science and aerospace science, different computational fluid dynamics (CFD) methodologies or different CFD tools (Q3D and MATRICS-V [33]) are commonly used to obtain MF data. Reynolds-Averaged Navier–Stokes (RANS) method is commonly used to obtain HF data [30,31,32], and the potential flow model to obtain LF data [26,29]. In reference [25], RANS is used to obtain LF data, while large-eddy simulations are used to obtain HF data. In reference [27], URANS and Euler models are used to provide LF and HF data, respectively. In reference [28], the doublet-lattice method and CFD-based flutter calculation are used to obtain LF and HF data, respectively. In mechanical engineering, different mathematical models (e.g., different finite element models [39,42] and surrogate models [12,40,41]) or different measuring tools (e.g., different strain sensors [43] and different wind-turbine-specific aero-servo-elastic computer simulators [44]) are commonly used to obtain MF data. For other fields, mainly in biomedical and economic fields, the methods of acquiring multi-fidelity data are listed in Table 1.

In addition to the MF data obtained through theoretical calculations, more accurate data can also be obtained through manual experiments, and the data obtained through experiments are often used as the data with the highest fidelity [1,34,35].

1.2. Multi-Fidelity Data from Different Hyperparameters

Different hyperparameters under the same method may also lead to the generation of MF data. The hyperparameters here are generally some parameters that affect the computational complexity. For example, in the theoretical calculation, different mesh sizes are often used to obtain MF data, in which the smaller mesh corresponds to the HF and the larger mesh corresponds to the LF [21,36,45]. In the field of materials science, by using the same functional but specifying different parameters (such as plane wave truncation energy, k point, atomic relaxation, etc.), the results obtained are also different [24]. In addition, MF data is also obtained using different time steps [22,23] or different convergence conditions and simulated objects [10]. Different iterations of the calculation method are also one of the ways to generate MF data. The data obtained from partial convergence [64] can be regarded as LF data, while the data obtained from full convergence can be regarded as HF data. For example, in reference [48], the results of approximately six hundred iterations and one hundred iterations are regarded as HF and LF, respectively.

The acquisition method of MF data determines the fidelity of the data. The fidelity type of the MF data obtained in different ways needs to be judged by the evaluation method or prior knowledge. The selection of evaluation method or the use of prior knowledge also affects the fidelity of the data.

1.3. Multi-Fidelity Datasets under Materials Science

In the field of materials science, machine learning is mostly used to predict the property value of materials in reality, while the known existing and widely used datasets are as follows: Materials Project (MP) [65], Open Quantum Materials Database (OQMD) [66] and other datasets, which are usually calculated by density functional theory (DFT) or other theoretical methods, and there are different degrees of error with the real attribute values in the real world. In addition, there are even a variety of measurement methods with different results for the properties of some materials [67].

Multi-fidelity data is a prevalent phenomenon within the domain of materials science, extensively manifested in extant datasets. For example, in the MP dataset, there are multiple calculation methods for a certain attribute of the same substance, most of which are theoretical calculations. The attribute value calculated by the corresponding method can be obtained by calling the API in MP. Subsequently, this section introduces two large datasets with a wide range of MF data: the band gap width dataset calculated by different density functional in MP [1,2] and the molecular ultraviolet/visible spectral attribute dataset [19] obtained from different experiments, as shown in Figure 2 and Figure 3.

As can be seen from Figure 2, the data amount calculated by the PBE method is far more than that calculated by HSE, SCAN, and GLLB. PBE datasets are often used as LF data due to their large volume, and there are also articles that divide them and assign weights of different sizes to them as MF data, and on this basis, study the MF data learning method [68].

In the field of materials, there are more than two datasets that are used to construct MF models at the same time. For example, in reference [1], four different datasets in Figure 2 are used as MF datasets at the same time, and the influence of different combination methods on the training results is discussed. The reader can obtain an MF dataset with a band gap in MP from https://doi.org/10.6084/m9.figshare.13040330 (accessed on 3 December 2023).

Figure 3 shows the dataset containing dyes and solvents of optical materials. These data were obtained through different experiments, and contain their quantities, storage forms (e.g., names or SMILES), etc. Figure 3 shows the quantities of dyes and solvents of optical materials in nine common datasets, from largest to smallest, as follows: Deep4Chem [69], ChemDataExtractor [70], DSSCDB [71], ChemFlour [72], Dye aggregation [73], NIST [74], Fluorophores.org [75], PhotochemCAD [76], and UV/Vis+ [77]. Reference [19] adopts the top five datasets for the experiment. The relevant data and their code can be found in reference [19].

2. Multi-Fidelity Data Learning Methods

The key starting point of an MF data learning method is to find the relationship between MF data. The learning methods for MF data mainly include the following two directions [6]: (1) the hierarchical way is used to organize HF and LF model information, which is called the multi-fidelity hierarchical model (MFHM) [78,79,80]. (2) The HF and LF model information is fused through surrogate model construction, which is called the multi-fidelity surrogate model (MFSM) [7]. This article focuses on MFSM. Surrogate models, also known as meta models, response surface models, or approximation models, are simplifications of the true model. The advantage of using surrogate models is that the physical meaning between model input values and response values can be ignored, and one can directly focus on the data themselves for mathematical modeling [81].

MFSM can be divided into deterministic methods (DM) and non-deterministic methods (NDM) according to the estimation methods for parameters of MFSM [6]. A DM finds their parameters by minimizing the difference between data and functions [82,83]. An NDM assumes that the function or parameter is uncertain and uses the sample to reduce the uncertainty [84]. The most popular in NDM is the Bayesian framework [9,10,85], where the posterior distribution of the model parameters depends on the likelihood distribution and prior distribution of unknown parameters. A common implementation method of the Bayesian framework is Gaussian process (GP) [86,87,88]. In the validation part, the NDM of GP is used. The reason for employing GP lies in its established track record of effectively handling multi-fidelity data and delivering favorable outcomes [89,90,91]. Statistics indicate that among the numerous methods used for constructing surrogate models, 37% utilize GP [92].

In this paper, we utilize GP for subsequent experimental validation. The chosen task type is regression, for which we deliberately introduced various types of noise into a small and low-dimensional dataset. In the case of MF datasets with diverse noise types, traditional linear regression struggles to handle them effectively [93]. GP, being an NDM, provides estimates of uncertainty in predictions, which proves highly valuable when dealing with data containing diverse types of noise [94]. On the other hand, references [95,96] found that NDMs tend to be more accurate compared to DMs. Regarding MFSM, whether a DM or NDM, the correction approaches for MFSM can be classified into three major categories: additive correction, multiplicative correction, and comprehensive correction. We describe each of these methodologies in the following sections.

2.1. Additive Correction

Additive correction (AC) [97] corrects the response value of the LF data by constructing a surrogate model of the difference between the HF data and the LF data, which is used as an approximation of the response value of the HF model. The equation can be expressed as Equation (1):

{\hat{y}}_{h} (x) = y_{l} (x) + δ (x)

(1)

In Equation (1),

{\hat{y}}_{h} (x)

is the approximation of the HF response value at x, while

y_{l} (x)

is the response value of the LF data at the same position. The term

δ (x)

is a surrogate model that captures the difference between the HF data and the LF data at x. The core idea lies in combining the LF data response value

y_{l} (x)

and the difference function

δ (x)

to approximate the response value of the HF data, a technique termed as the additive correction.

Δ

-learning [98,99] is also considered as an additive correction method.

2.2. Multiplicative Correction

Multiplicative correction (MC) [100] corrects the response value of the LF data by constructing a surrogate model of the ratio between the HF data and the LF data. The equation can be expressed as Equation (2):

{\hat{y}}_{h} (x) = {ρ (x) y}_{l} (x)

(2)

In Equation (2),

{\hat{y}}_{h} (x)

is the approximation of the HF response value at x, while

y_{l} (x)

is the response value of the LF data at the same position. The key element here is the surrogate model

ρ (x)

, which represents the model of the ratio between HF and LF data. Specifically, the product of the LF response value

y_{l} (x)

and the ratio function

ρ (x)

forms an approximation of the HF response value.

2.3. Comprehensive Correction

When the HF response value cannot be approximated by a simple additive correction or multiplicative correction, it is better to consider using a comprehensive correction (CC), which integrates additive correction and multiplicative correction. Generally, it can be expressed as Equation (3):

{\hat{y}}_{h} (x) = {ρ (x) y}_{l} (x) + δ (x)

(3)

where

ρ (x)

and

δ (x)

are the multiplicative correction surrogate model and the additive correction surrogate model, respectively, and

ρ (x)

is usually replaced by a constant

ρ

[9,90,95,101]. The replacement equation is as follows:

{\hat{y}}_{h} (x) = {ρ y}_{l} (x) + δ (x)

(4)

In the validation section of this paper, Gaussian process regression (GPR) [102,103] is employed as the surrogate model. GPR is based on Bayesian principles, wherein it characterizes model uncertainty by constructing a prior distribution and updates this distribution based on the observed data to obtain a posterior distribution. Specifically, GPR utilizes a kernel function to compute the correlations between the data and subsequently constructs a covariance matrix. The procedure is as follows:

Assuming we have a dataset

(X, y)

, where X and y represent the input features of the data and their corresponding labels, and we have new data

x_{N^{'}}

that we want to predict using GPR, GPR constructs a covariance matrix K that describes the relationships among all the data (including the white noise):

K = [\begin{matrix} K_{N} + σ_{n}^{2} I & K_{N N^{'}} \\ K_{N^{'} N} & K_{N^{'} N^{'}} \end{matrix}]

(5)

where

K_{N}

represents the covariance matrix of the original data X,

K_{N N^{'}}

and

K_{N^{'} N}

represent the covariance matrices between the data points X and the data point to be predicted, and

K_{N^{'} N^{'}}

represents the covariance matrix of the data point to be predicted.

σ_{n}^{2}

represents the variance of the white noise added during modeling, and I denotes the identity matrix. For each element within the covariance matrix, it can be computed using a covariance function; in this case, we employ the Gaussian kernel function (also known as the radial basis function):

K^{i j} = k (x, x^{'}) = σ_{f}^{2} exp (- \frac{1}{2 σ_{l}^{2}} {∥ x - x^{'} ∥}^{2}) + δ_{i j} σ_{n}^{2}

(6)

where

σ_{f}

and

σ_{l}

are hyperparameters that control the characteristics of the covariance function.

δ_{i j}

is used to ensure that only elements where

i = j

are affected by the noise term, representing that the observation errors between different data points are uncorrelated.

∥ x - x^{'} ∥

represents the Euclidean distance between two data points. Once the estimation of model hyperparameters is completed, the mean

(μ_{N^{'}})

and variance

(Σ_{N^{'}})

of the prediction result are obtained by maximizing the conditional likelihood:

μ_{N^{'}} = K_{N^{'} N} {(K_{N} + σ_{n}^{2} I)}^{- 1} y

(7)

Σ_{N^{'}} = K_{N^{'} N^{'}} - K_{N^{'} N} {(K_{N} + σ_{n}^{2} I)}^{- 1} K_{N N^{'}} + σ_{n}^{2} I

(8)

In practical applications, the prediction result is often represented by the mean

(μ_{N^{'}})

.

In the correction methods used in this paper’s validation, GPR is employed as the surrogate model. Specifically, for the comprehensive correction, a constant

ρ

is used to replace the multiplicative correction term, and the equation is as follows:

{\hat{y}}_{h} (x) = ρ {\hat{y}}_{l} (x) + δ (x)

(9)

where

{\hat{y}}_{l} (x)

and

δ (x)

represent surrogate models for the LF data and discrepancy data, respectively. The value of

ρ

can be obtained by minimizing the following equation:

min : {(δ (x_{h}) - d_{h})}^{2}

(10)

d_{h} = ρ {\hat{y}}_{l} (x_{h}) - y_{h}

(11)

The solution for

ρ

can be determined through the method of least squares to minimize Equation (10). For the given Equation (10), it can be treated as a quadratic function with respect to

ρ

. To find the value of

ρ

that minimizes the function, you can calculate its derivative and set it equal to zero for solving.

MF data learning methods have their own advantages and disadvantages. For example, when calculating

ρ (x)

in the multiplicative correction, if

y_{l} (x)

as the denominator is zero, then

ρ (x)

has no solution. Additive correction can make up for the deficiency of multiplicative correction, but it is not unexpected that additive correction is not always better than multiplicative correction, which is proven in reference [104,105]. The comprehensive correction method makes up for the shortcomings of additive correction and multiplicative correction to a certain extent, but a major drawback of this method is that it can only deal with two kinds of linearly related fidelity data, while in some cases [8,106], the relationship between LF and HF data follows a nonlinear relationship. For this case, reference [21] decomposed the relationship between HF and LF data into linear and nonlinear parts and solved it using a neural network.

Next, we evaluate the performance of these three correction methods based on GPR.

3. Performance Validation of Existing Learning Algorithms

This section demonstrates the learning methods of the above three MF data through validation. The method of artificially constructing MF data is adopted to apply additive correction, multiplicative correction, and comprehensive correction methods, respectively, on the datasets containing additive noise, multiplicative noise, and mixed noise (i.e., containing both additive and multiplicative noise). This section introduces both MF data acquisition and processing, as well as the validation results.

3.1. Multi-Fidelity Data Acquisition and Processing

In this validation, MF datasets are artificially constructed to obtain datasets. Here, we assumed an ideal function as Equation (12) to compute the true values.

Y_{t r u e} (x) = p_{1} * \sin ({p_{2} * x}^{2} + p_{3}) * ({p_{4} * x}^{2} + p_{5}) + p_{6} * (p_{7} * x^{3} + p_{8}) + p_{9} * x + p_{10}

(12)

where

p_{1}

∼

p_{10}

are random parameters. We conducted a total of one hundred tests to validate our approach. For brevity, we present the details of one with parameters (in Equation (12)) set as

P_{s a m p l e}

to typify the behavior observed across one hundred repetitions. The parameters selected are as follows:

P_{s a m p l e} = [- 1, - 5, - 4, 5, - 1, - 2, 1, 0, - 4, - 2]

(13)

where the true value function corresponding to the parameter

P_{s a m p l e}

is shown in Figure 4. We selected the results of one of these for display, as shown in Figure 5, which includes the distribution of HF and LF data points, true value, HF data fitting model, and different correction methods under different noises.

In a certain range, this function can exhibit wave-like behavior. In reality, we can assume that the function curve is obtained from a plane wave. Based on this assumption, we can add additive noise, multiplicative noise, and mixed noise to the function for different MF data learning methods. The LF data input value of the validation was two hundred points equidistant from zero to one, denoted by

X_{l}

. The input values for the HF data are contained in the LF data input values, which are twenty equidistant points between zero and one, denoted by

X_{h}

. The response values of HF and LF data are represented by

Y_{h}

and

Y_{l}

, respectively. The equation is as follows:

Y_{h} = {E_{h_{*}} * Y}_{t r u e} + E_{h_{+}}

(14)

Y_{l} = {E_{l_{*}} * Y}_{t r u e} + E_{l_{+}}

(15)

where

E_{h_{*}}

and

E_{l_{*}}

, respectively, represent multiplicative noise under HF and LF; and

E_{h_{+}}

and

E_{l_{+}}

, respectively, represent additive noise under HF and LF. The additive noise in the HF data is relatively small, while the additive noise in the LF data is relatively large. The multiplicative noise in the HF data is close to one, while the multiplicative noise in the LF data is far from one. The above formula is the construction formula for the MF data in the comprehensive correction method. In Equations (14) and (15), we can set parameters

E_{h_{*}}

and

E_{l_{*}}

to one for generating datasets only containing additive noise and set parameters

E_{h_{+}}

and

E_{l_{+}}

to zero to generate datasets with multiplicative noise only. During validation, for the datasets with additive noise, random numbers ranging from 0 to 1 are generated. These numbers are then multiplied by 0.4 and 0.8 to obtain the values for

E_{h_{+}}

and

E_{l_{+}}

. For multiplicative noise, we randomly select numbers within the range of 0.9 to 1.1 to represent

E_{h_{*}}

, and within the range of 0.8 to 1.3 to represent

E_{l_{*}}

, respectively.

The above HF and LF datasets with different noise added are randomly shuffled as the training set, while the input values of the test set were an additional twenty points from zero to one that were not included in the HF and LF datasets. The mean square error (MSE) was used as an evaluation method, and the experiment was repeated one hundred times randomly with different parameter arrays P to calculate the mean MSE between each MF data learning method and the true value and the mean MSE between the high-fidelity data model (HFDM) and the true value.

3.2. Validation Results

Through the above MF data learning method, it is not difficult to find from Figure 5 that the results obtained by using the MF data learning method between the range of 0.55 and 0.75 are closer to the ‘True value’ than those obtained by the high fidelity data model (HFDM). This is because the amount of HF data is small, leading to the low accuracy of the fitted model. Conversely, the MF data learning method can simultaneously use HF and LF data, and the HF data in the unknown location can be approximated by the known LF data, so the fitting results are relatively good.

The experiments were repeated one hundred times, randomly utilizing different parameter arrays P. In each model construction, the method of 5-fold cross-validation [107,108] is used to find out the model with the best parameters, and on this basis, the training and testing are carried out. The average MSE between each MF data learning method and the true value, as well as the average MSE between HFDM and the true value, were calculated. The results are presented in Table 3. For MF datasets with arbitrary noise, comprehensive correction (CC) has the lowest MSE and the high-fidelity data model (HFDM) has the highest MSE. For MF datasets with only additive noise, the MSE of additive correction (AC) is lower than that of multiplicative correction (MC). For MF datasets with only multiplicative noise, the MSE of MC is lower than that of AC.

The experimental results show that the results obtained by using the multi-fidelity data learning method are better than those obtained by using only high-fidelity data modeling. Among the three correction methods, the comprehensive correction has the best effect, which is corrected on the basis of additive and multiplicative correction. In addition, we also used RMSE, MAE,

R^{2}

, and other evaluation methods for testing. We found that the multiplicative correction outperforms the additive correction only in datasets with multiplicative noise. However, in datasets with other types of noise, the multiplicative correction does not outperform the other correction methods. On the contrary, the comprehensive correction consistently yields the best results across all noise types. This further validates the advantage of the comprehensive correction method across various noise datasets. In summary, employing MF data learning methods can enhance model accuracy when faced with limited HF data and abundant LF data. When the noise of multi-fidelity datasets is not known, the comprehensive correction method can obtain better results. The choice of specific correction methods largely depends on the MF dataset. Finally, We also used two statistical test methods, Shapiro–Wilk [109] and Levene [110], to test the mean error (ME) of the four methods under different noises. Readers can find all the relevant codes and results in the data availability statement for this paper.

4. Outlook and Limitations

In this paper, we focus on the application of multi-fidelity (MF) data in materials science, including methods for acquiring MF data, datasets in previous works, and machine learning algorithms. Additionally, we applied different machine learning methods to handle simulated MF data with different types of noise (additive, multiplicative, and mixed) as validation. Our results indicate that it is helpful to introduce a larger amount of lower fidelity data to work together with expensive higher fidelity data.

In an era where materials science is increasingly becoming data driven, the transformation from a single-fidelity-data driven approach to a multi-fidelity-data driven paradigm seems inevitable. As a part of this ongoing evolution, we propose the following outlook and limitations for the future research in the area of multi-fidelity data driven methods:

Classic methods, such as the Gaussian process regression as demonstrated in this study, will continue to play a pivotal role. For this paper, while classic methods such as Gaussian process regression prove effective, their computational demands might limit their performance on larger datasets or higher-dimensional problems. For researchers, choosing the appropriate model and having a clear understanding of the algorithm and its principles remain indispensable.
Alongside hardware advancements, newer methods are surfacing that offer innovative ways to address challenges. Techniques such as large language model (LLM) warrant close attention, as they could redefine how we handle multi-fidelity problems. However, concurrently, utilizing LLM or similar complex technologies may introduce challenges associated with interpretability, computational resources, and data requirements.
Active learning strategies, especially when integrated with both experimental and theoretical multi-fidelity data, hold the potential for groundbreaking contributions. Machine learning models can determine not only material structures or features but also which levels of data fidelity are most informative for exploration. However, a pertinent issue is that active learning strategies necessitate high-quality and representative initial datasets. If the initial data lack quality or do not aptly represent the studied domain, it may impact the effectiveness of learning. Acquiring suitable multi-fidelity datasets is, thus, a critical concern for researchers.

Author Contributions

Validation, Z.W.; investigation, Z.W. and H.C.; resources, X.L., T.Y. and Y.H.; visualization, Z.W.; supervision, X.L. and T.Y.; funding acquisition, X.L., T.Y. and Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

The authors extend their sincere appreciation to the National Natural Science Foundation of China for supporting this work through the funding of project number 22203008, 22272009, 22002008.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

We generated the data used in this study, along with the corresponding code and results, which are available on our GitHub repository at https://github.com/1152041831/MFSM_experiment (accessed on 3 December 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, X.; De Breuck, P.P.; Wang, L.; Rignanese, G.M. A simple denoising approach to exploit multi-fidelity data for machine learning materials properties. NPJ Comput. Mater. 2022, 8, 233. [Google Scholar] [CrossRef]
Chen, C.; Zuo, Y.; Ye, W.; Li, X.; Ong, S.P. Learning properties of ordered and disordered materials from multi-fidelity data. Nat. Comput. Sci. 2021, 1, 46–53. [Google Scholar] [CrossRef]
Jarin, S.; Yuan, Y.; Zhang, M.; Hu, M.; Rana, M.; Wang, S.; Knibbe, R. Predicting the crystal structure and lattice parameters of the perovskite materials via different machine learning models based on basic atom properties. Crystals 2022, 12, 1570. [Google Scholar] [CrossRef]
Zhao, X.; Liu, D.; Yan, X. Diameter Prediction of Silicon Ingots in the Czochralski Process Based on a Hybrid Deep Learning Model. Crystals 2022, 13, 36. [Google Scholar] [CrossRef]
Gao, P.; Liu, Z.; Zhang, J.; Wang, J.A.; Henkelman, G. A Fast, Low-Cost and Simple Method for Predicting Atomic/Inter-Atomic Properties by Combining a Low Dimensional Deep Learning Model with a Fragment Based Graph Convolutional Network. Crystals 2022, 12, 1740. [Google Scholar] [CrossRef]
Fernández-Godino, M.G.; Park, C.; Kim, N.H.; Haftka, R.T. Review of multi-fidelity models. arXiv 2016, arXiv:1609.07196. [Google Scholar]
Forrester, A.I.; Sóbester, A.; Keane, A.J. Multi-fidelity optimization via surrogate modelling. Proc. R. Soc. A 2007, 463, 3251–3269. [Google Scholar] [CrossRef]
Perdikaris, P.; Raissi, M.; Damianou, A.; Lawrence, N.D.; Karniadakis, G.E. Nonlinear information fusion algorithms for data-efficient multi-fidelity modelling. Proc. R. Soc. A 2017, 473, 20160751. [Google Scholar] [CrossRef]
Tran, A.; Tranchida, J.; Wildey, T.; Thompson, A.P. Multi-fidelity machine-learning with uncertainty quantification and Bayesian optimization for materials design: Application to ternary random alloys. J. Chem. Phys. 2020, 153, 074705. [Google Scholar] [CrossRef]
Li, J.; Yang, A.; Tian, C.; Ye, L.; Chen, B. Multi-fidelity Bayesian algorithm for antenna optimization. J. Syst. Eng. Electron. 2022, 33, 1119–1126. [Google Scholar]
Brevault, L.; Balesdent, M.; Hebbal, A. Overview of Gaussian process based multi-fidelity techniques with variable relationship between fidelities, application to aerospace systems. Aerospace Sci. Technol. 2020, 107, 106339. [Google Scholar] [CrossRef]
Romor, F.; Tezzele, M.; Mrosek, M.; Othmer, C.; Rozza, G. Multi-fidelity data fusion through parameter space reduction with applications to automotive engineering. Int. J. Numer. Methods Eng. 2023, 124, 5293–5311. [Google Scholar] [CrossRef]
Absi, G.N.; Mahadevan, S. Multi-fidelity approach to dynamics model calibration. Mech. Syst. Signal Process. 2016, 68, 189–206. [Google Scholar] [CrossRef]
Pilania, G.; Gubernatis, J.E.; Lookman, T. Multi-fidelity machine learning models for accurate bandgap predictions of solids. Comput. Mater. Sci. 2017, 129, 156–163. [Google Scholar] [CrossRef]
Patra, A.; Batra, R.; Chandrasekaran, A.; Kim, C.; Huan, T.D.; Ramprasad, R. A multi-fidelity information-fusion approach to machine learn and predict polymer bandgap. Comput. Mater. Sci. 2020, 172, 109286. [Google Scholar] [CrossRef]
Polak, M.P.; Jacobs, R.; Mannodi-Kanakkithodi, A.; Chan, M.K.; Morgan, D. Machine learning for impurity charge-state transition levels in semiconductors from elemental properties using multi-fidelity datasets. J. Chem. Phys. 2022, 156, 114110. [Google Scholar] [CrossRef] [PubMed]
Egorova, O.; Hafizi, R.; Woods, D.C.; Day, G.M. Multifidelity statistical machine learning for molecular crystal structure prediction. J. Phys. Chem. A 2020, 124, 8065–8078. [Google Scholar] [CrossRef] [PubMed]
Mannodi-Kanakkithodi, A.; Toriyama, M.Y.; Sen, F.G.; Davis, M.J.; Klie, R.F.; Chan, M.K. Machine-learned impurity level prediction for semiconductors: The example of Cd-based chalcogenides. NPJ Comput. Mater. 2020, 6, 39. [Google Scholar] [CrossRef]
Greenman, K.P.; Green, W.H.; Gómez-Bombarelli, R. Multi-fidelity prediction of molecular optical peaks with deep learning. Chem. Sci. 2022, 13, 1152–1162. [Google Scholar] [CrossRef]
Khatamsaz, D.; Molkeri, A.; Couperthwaite, R.; James, J.; Arróyave, R.; Srivastava, A.; Allaire, D. Adaptive active subspace-based efficient multifidelity materials design. Mater. Des. 2021, 209, 110001. [Google Scholar] [CrossRef]
Sun, G.; Li, G.; Stone, M.; Li, Q. A two-stage multi-fidelity optimization procedure for honeycomb-type cellular materials. Comput. Mater. Sci. 2010, 49, 500–511. [Google Scholar] [CrossRef]
Islam, M.; Thakur, M.S.H.; Mojumder, S.; Hasan, M.N. Extraction of material properties through multi-fidelity deep learning from molecular dynamics simulation. Comput. Mater. Sci. 2021, 188, 110187. [Google Scholar] [CrossRef]
Razi, M.; Narayan, A.; Kirby, R.; Bedrov, D. Fast predictive models based on multi-fidelity sampling of properties in molecular dynamics simulations. Comput. Mater. Sci. 2018, 152, 125–133. [Google Scholar] [CrossRef]
Batra, R.; Pilania, G.; Uberuaga, B.P.; Ramprasad, R. Multifidelity information fusion with machine learning: A case study of dopant formation energies in hafnia. ACS Appl. Mater. Interfaces 2019, 11, 24906–24918. [Google Scholar] [CrossRef] [PubMed]
Lamberti, G.; Gorlé, C. A multi-fidelity machine learning framework to predict wind loads on buildings. J. Wind Eng. Ind. Aerodyn. 2021, 214, 104647. [Google Scholar] [CrossRef]
Nagawkar, J.; Leifsson, L. Multifidelity aerodynamic flow field prediction using random forest-based machine learning. Aerospace Sci. Technol. 2022, 123, 107449. [Google Scholar] [CrossRef]
Kou, J.; Zhang, W. Multi-fidelity modeling framework for nonlinear unsteady aerodynamics of airfoils. Appl. Math. Model. 2019, 76, 832–855. [Google Scholar] [CrossRef]
Thelen, A.S.; Leifsson, L.T.; Beran, P.S. Aeroelastic flutter prediction using multifidelity modeling of the generalized aerodynamic influence coefficients. AIAA J. 2020, 58, 4764–4780. [Google Scholar] [CrossRef]
Thelen, A.S.; Leifsson, L.T.; Beran, P.S. Multifidelity flutter prediction using regression cokriging with adaptive sampling. J. Fluids Struct. 2020, 97, 103081. [Google Scholar] [CrossRef]
Singh, D.; Antoniadis, A.F.; Tsoutsanis, P.; Shin, H.S.; Tsourdos, A.; Mathekga, S.; Jenkins, K.W. A multi-fidelity approach for aerodynamic performance computations of formation flight. Aerospace 2018, 5, 66. [Google Scholar] [CrossRef]
Ariyarit, A.; Kanazaki, M. Multi-fidelity multi-objective efficient global optimization applied to airfoil design problems. Appl. Sci. 2017, 7, 1318. [Google Scholar] [CrossRef]
Huang, L.; Gao, Z.; Zhang, D. Research on multi-fidelity aerodynamic optimization methods. Chin. J. Aeronaut. 2013, 26, 279–286. [Google Scholar] [CrossRef]
Elham, A. Adjoint quasi-three-dimensional aerodynamic solver for multi-fidelity wing aerodynamic shape optimization. Aerospace Sci. Technol. 2015, 41, 241–249. [Google Scholar] [CrossRef]
Li, K.; Kou, J.; Zhang, W. Deep learning for multifidelity aerodynamic distribution modeling from experimental and simulation data. AIAA J. 2022, 60, 4413–4427. [Google Scholar] [CrossRef]
Ryou, G.; Tal, E.; Karaman, S. Multi-fidelity black-box optimization for time-optimal quadrotor maneuvers. Int. J. Robot. Res. 2021, 40, 1352–1369. [Google Scholar] [CrossRef]
Liu, J.; Bian, X.; Xu, G.; Li, S.; Wang, J. Optimal Design of Nozzle Based on Multi-fidelity Surrogate Model. Adv. Aeronaut. Sci. Eng. 2022, 13, 29–39. [Google Scholar] [CrossRef]
Brooks, C.J.; Forrester, A.; Keane, A.; Shahpar, S. Multi-fidelity design optimisation of a transonic compressor rotor. In Proceedings of the 9th European Conference Turbomachinery Fluid Dynamics and Thermodynamics, Istanbul, Turkey, 21–25 March 2011; pp. 1–10. [Google Scholar]
Shah, H.; Hosder, S.; Koziel, S.; Tesfahunegn, Y.A.; Leifsson, L. Multi-fidelity robust aerodynamic design optimization under mixed uncertainty. Aerospace Sci. Technol. 2015, 45, 17–29. [Google Scholar] [CrossRef]
Lai, X.; He, X.; Wang, S.; Wang, X.; Sun, W.; Song, X. Building a Lightweight Digital Twin of a Crane Boom for Structural Safety Monitoring Based on a Multifidelity Surrogate Model. J. Mech. Des. 2022, 144, 064502. [Google Scholar] [CrossRef]
Shi, R.; Liu, L.; Long, T.; Wu, Y.; Gary Wang, G. Multi-fidelity modeling and adaptive co-kriging-based optimization for all-electric geostationary orbit satellite systems. J. Mech. Des. 2020, 142, 021404. [Google Scholar] [CrossRef]
Jacobs, J.P.; Koziel, S. Cost-effective global surrogate modeling of planar microwave filters using multi-fidelity bayesian support vector regression. Int. J. RF Microw. Comput.-Aided Eng. 2014, 24, 11–17. [Google Scholar] [CrossRef]
Kim, H.S.; Koc, M.; Ni, J. A hybrid multi-fidelity approach to the optimal design of warm forming processes using a knowledge-based artificial neural network. Int. J. Mach. Tools Manuf. 2007, 47, 211–222. [Google Scholar] [CrossRef]
Jin, S.S.; Kim, S.T.; Park, Y.H. Combining point and distributed strain sensor for complementary data-fusion: A multi-fidelity approach. Mech. Syst. Signal Process. 2021, 157, 107725. [Google Scholar] [CrossRef]
Abdallah, I.; Lataniotis, C.; Sudret, B. Hierarchical Kriging for multi-fidelity aero-servo-elastic simulators—Application to extreme loads on wind turbines. arXiv 2017, arXiv:1709.07637. [Google Scholar] [CrossRef]
Bu, H.; Yang, Y.; Song, L.; Li, J. Improving the Film Cooling Performance of a Turbine Endwall With Multi-Fidelity Modeling Considering Conjugate Heat Transfer. J. Turbomach. 2022, 144, 011011. [Google Scholar] [CrossRef]
Mell, L.; Rey, V.; Schoefs, F. Multifidelity adaptive kriging metamodel based on discretization error bounds. Int. J. Numer. Methods Eng. 2020, 121, 4566–4583. [Google Scholar] [CrossRef]
Koziel, S.; Pietrenko-Dabrowska, A. Accelerated gradient-based optimization of antenna structures using multifidelity simulations and convergence-based model management scheme. IEEE Trans. Antennas Propag. 2021, 69, 8778–8789. [Google Scholar] [CrossRef]
Palar, P.S.; Shimoyama, K. Multi-Fidelity Uncertainty Analysis in CFD Using Hierarchical Kriging. In Proceedings of the 35th AIAA Applied Aerodynamics Conference, Denver, CO, USA, 5–9 June 2017; p. 3261. [Google Scholar]
Parussini, L.; Venturi, D.; Perdikaris, P.; Karniadakis, G.E. Multi-fidelity Gaussian process regression for prediction of random fields. J. Comput. Phys. 2017, 336, 36–50. [Google Scholar] [CrossRef]
Qiu, Y.; Song, J.; Liu, Z. A simulation optimisation on the hierarchical health care delivery system patient flow based on multi-fidelity models. Int. J. Prod. Res. 2016, 54, 6478–6493. [Google Scholar] [CrossRef]
Sajjadinia, S.S.; Carpentieri, B.; Shriram, D.; Holzapfel, G.A. Multi-fidelity surrogate modeling through hybrid machine learning for biomechanical and finite element analysis of soft tissues. Comput. Biol. Med. 2022, 148, 105699. [Google Scholar] [CrossRef]
Biehler, J.; Gee, M.W.; Wall, W.A. Towards efficient uncertainty quantification in complex and large-scale biomechanical problems based on a Bayesian multi-fidelity scheme. Biomech. Model. Mechanobiol. 2015, 14, 489–513. [Google Scholar] [CrossRef]
Panda, K.; King, R.; Maack, J.; Satkauskas, I.; Potter, K. Visualization of Multi-Fidelity Approximations of Stochastic Economic Dispatch. In Proceedings of the Twelfth ACM International Conference on Future Energy Systems, Virtual Event, 28 June–2 July 2021; pp. 372–376. [Google Scholar]
Perdew, J.P.; Levy, M. Physical content of the exact Kohn-Sham orbital energies: Band gaps and derivative discontinuities. Phys. Rev. Lett. 1983, 51, 1884–1887. [Google Scholar] [CrossRef]
Hautier, G.; Ong, S.P.; Jain, A.; Moore, C.J.; Ceder, G. Accuracy of density functional theory in predicting formation energies of ternary oxides from binary oxides and its implication on phase stability. Phys. Rev. B 2012, 85, 155208. [Google Scholar] [CrossRef]
Bartel, C.J.; Weimer, A.W.; Lany, S.; Musgrave, C.B.; Holder, A.M. The role of decomposition reactions in assessing first-principles predictions of solid stability. NPJ Comput. Mater. 2019, 5, 4. [Google Scholar] [CrossRef]
Morales-García, Á.; Valero, R.; Illas, F. An empirical, yet practical way to predict the band gap in solids by using density functional band structure calculations. J. Phys. Chem. C 2017, 121, 18862–18866. [Google Scholar] [CrossRef]
Kohn, W.; Sham, L.J. Self-consistent equations including exchange and correlation effects. Phys. Rev. 1965, 140, A1133–A1138. [Google Scholar] [CrossRef]
Perdew, J.P.; Burke, K.; Ernzerhof, M. Generalized gradient approximation made simple. Phys. Rev. Lett. 1996, 77, 3865–3868. [Google Scholar] [CrossRef]
Adamo, C.; Barone, V. Toward chemical accuracy in the computation of NMR shieldings: The PBE0 model. Chem. Phys. Lett. 1998, 298, 113–119. [Google Scholar] [CrossRef]
Heyd, J.; Scuseria, G.E.; Ernzerhof, M. Hybrid functionals based on a screened Coulomb potential. J. Chem. Phys. 2003, 118, 8207–8215. [Google Scholar] [CrossRef]
Jie, J.; Weng, M.; Li, S.; Chen, D.; Li, S.; Xiao, W.; Zheng, J.; Pan, F.; Wang, L. A new MaterialGo database and its comparison with other high-throughput electronic structure databases for their predicted energy band gaps. Sci. China Technol. Sci. 2019, 62, 1423–1430. [Google Scholar] [CrossRef]
Thompson, A.P.; Swiler, L.P.; Trott, C.R.; Foiles, S.M.; Tucker, G.J. Spectral neighbor analysis method for automated generation of quantum-accurate interatomic potentials. J. Comput. Phys. 2015, 285, 316–330. [Google Scholar] [CrossRef]
Forrester, A.I.; Bressloff, N.W.; Keane, A.J. Optimization using surrogate models and partially converged computational fluid dynamics simulations. Proc. R. Soc. A 2006, 462, 2177–2204. [Google Scholar] [CrossRef]
Jain, A.; Ong, S.P.; Hautier, G.; Chen, W.; Richards, W.D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G.; et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 2013, 1, 011002. [Google Scholar] [CrossRef]
Kirklin, S.; Saal, J.E.; Meredig, B.; Thompson, A.; Doak, J.W.; Aykol, M.; Rühl, S.; Wolverton, C. The Open Quantum Materials Database (OQMD): Assessing the accuracy of DFT formation energies. NPJ Comput. Mater. 2015, 1, 15010. [Google Scholar] [CrossRef]
Hong, J. The Band Gap Problem: The State of the Art of First-Principles Electronic Band Structure Theory. Prog. Chem. 2012, 24, 910–927. [Google Scholar]
Xiaotong, L.; Ziming, W.; Jiahua, O.; Tao, Y. A Quantitative Noise Method to Evaluate Machine Learning Algorithm on Multi-Fidelity Data. J. Chin. Ceram. Soc. 2023, 51, 405–410. [Google Scholar]
Joung, J.F.; Han, M.; Jeong, M.; Park, S. Experimental database of optical properties of organic compounds. Sci. Data 2020, 7, 295. [Google Scholar] [CrossRef] [PubMed]
Beard, E.J.; Sivaraman, G.; Vázquez-Mayagoitia, Á.; Vishwanath, V.; Cole, J.M. Comparative dataset of experimental and computational attributes of UV/vis absorption spectra. Sci. Data 2019, 6, 307. [Google Scholar] [CrossRef] [PubMed]
Venkatraman, V.; Raju, R.; Oikonomopoulos, S.P.; Alsberg, B.K. The dye-sensitized solar cell database. J. Cheminf. 2018, 10, 18. [Google Scholar] [CrossRef]
Ju, C.W.; Bai, H.; Li, B.; Liu, R. Machine learning enables highly accurate predictions of photophysical properties of organic fluorescent materials: Emission wavelengths and quantum yields. J. Chem. Inf. Model. 2021, 61, 1053–1065. [Google Scholar] [CrossRef]
Venkatraman, V.; Kallidanthiyil Chellappan, L. An open access data set highlighting aggregation of dyes on metal oxides. Data 2020, 5, 45. [Google Scholar] [CrossRef]
Talrose, V.; Yermakov, A.N.; Usov, A.A.; Goncharova, A.A.; Leskin, A.N.; Messineva, N.A.; Trusova, N.V.; Efimkina, M.V. NIST Chemistry WebBook. 2022; p. 20899. Available online: https://webbook.nist.gov/chemistry/ (accessed on 3 December 2023).
Mayr, T. Fluorophores.org. Available online: http://www.fluorophores.tugraz.at/substance/ (accessed on 3 December 2023).
Taniguchi, M.; Lindsey, J.S. Database of absorption and fluorescence spectra of >300 common compounds for use in photochem CAD. Photochem. Photobiol. 2018, 94, 290–327. [Google Scholar] [CrossRef] [PubMed]
Noelle, A.; Vandaele, A.C.; Martin-Torres, J.; Yuan, C.; Rajasekhar, B.N.; Fahr, A.; Hartmann, G.K.; Lary, D.; Lee, Y.P.; Limão-Vieira, P.; et al. UV/Vis+ photochemistry database: Structure, content and applications. J. Quant. Spectrosc. Radiat. Transf. 2020, 253, 107056. [Google Scholar] [CrossRef] [PubMed]
Lazzara, D.; Haimes, R.; Willcox, K. Multifidelity Geometry and Analysis in Aircraft Conceptual Design. In Proceedings of the 19th AIAA Computational Fluid Dynamics, San Antonio, TX, USA, 22–25 June 2009; p. 3806. [Google Scholar] [CrossRef]
Joly, M.M.; Verstraete, T.; Paniagua, G. Integrated multifidelity, multidisciplinary evolutionary design optimization of counterrotating compressors. Integr. Comput.-Aided Eng. 2014, 21, 249–261. [Google Scholar] [CrossRef]
Durantin, C.; Rouxel, J.; Désidéri, J.A.; Glière, A. Multifidelity surrogate modeling based on radial basis functions. Struct. Multidiscip. Optim. 2017, 56, 1061–1075. [Google Scholar] [CrossRef]
Myers, R.H.; Montgomery, D.C.; Anderson-Cook, C.M. Response Surface Methodology: Process and Product Optimization Using Designed Experiments; Wiley: Hoboken, NJ, USA, 2016. [Google Scholar]
Vitali, R.; Haftka, R.T.; Sankar, B.V. Multi-fidelity design of stiffened composite panel with a crack. Struct. Multidiscip. Optim. 2002, 23, 347–356. [Google Scholar] [CrossRef]
Goel, T.; Hafkta, R.T.; Shyy, W. Comparing error estimation measures for polynomial and kriging approximation of noise-free functions. Struct. Multidiscip. Optim. 2009, 38, 429–442. [Google Scholar] [CrossRef]
Le Gratiet, L.; Cannamela, C. Cokriging-based sequential design strategies using fast cross-validation techniques for multi-fidelity computer codes. Technometrics 2015, 57, 418–427. [Google Scholar] [CrossRef]
Jacobs, J.P.; Koziel, S.; Ogurtsov, S. Computationally efficient multi-fidelity Bayesian support vector regression modeling of planar antenna input characteristics. IEEE Trans. Antennas Propag. 2012, 61, 980–984. [Google Scholar] [CrossRef]
Kennedy, M.C.; O’Hagan, A. Bayesian calibration of computer models. J. R. Stat. Soc. B 2001, 63, 425–464. [Google Scholar] [CrossRef]
Schulz, E.; Speekenbrink, M.; Krause, A. A tutorial on Gaussian process regression: Modelling, exploring, and exploiting functions. J. Math. Psychol. 2018, 85, 1–16. [Google Scholar] [CrossRef]
MacKay, D.J. Introduction to Gaussian processes. NATO ASI Ser. F Comput. Syst. Sci. 1998, 168, 133–166. [Google Scholar]
Takeno, S.; Tsukada, Y.; Fukuoka, H.; Koyama, T.; Shiga, M.; Karasuyama, M. Cost-effective search for lower-error region in material parameter space using multifidelity Gaussian process modeling. Phys. Rev. Mater. 2020, 4, 083802. [Google Scholar] [CrossRef]
Perdikaris, P.; Venturi, D.; Royset, J.O.; Karniadakis, G.E. Multi-fidelity modelling via recursive co-kriging and Gaussian–Markov random fields. Proc. R. Soc. A 2015, 471, 20150018. [Google Scholar] [CrossRef] [PubMed]
Deringer, V.L.; Bartók, A.P.; Bernstein, N.; Wilkins, D.M.; Ceriotti, M.; Csányi, G. Gaussian process regression for materials and molecules. Chem. Rev. 2021, 121, 10073–10141. [Google Scholar] [CrossRef] [PubMed]
Shi, M.; Lv, L.; Sun, W.; Song, X. A multi-fidelity surrogate model based on support vector regression. Struct. Multidiscip. Optim. 2020, 61, 2363–2375. [Google Scholar] [CrossRef]
Granichin, O. Linear regression and filtering under nonstandard assumptions (Arbitrary noise). IEEE Trans. Autom. Control 2004, 49, 1830–1837. [Google Scholar] [CrossRef]
Noack, M.M.; Doerk, G.S.; Li, R.; Streit, J.K.; Vaia, R.A.; Yager, K.G.; Fukuto, M. Autonomous materials discovery driven by Gaussian process regression with inhomogeneous measurement noise and anisotropic kernels. Sci. Rep. 2020, 10, 17663. [Google Scholar] [CrossRef] [PubMed]
Keane, A.J. Cokriging for robust design optimization. AIAA J. 2012, 50, 2351–2364. [Google Scholar] [CrossRef]
Park, C.; Haftka, R.T.; Kim, N.H. Remarks on multi-fidelity surrogates. Struct. Multidiscip. Optim. 2017, 55, 1029–1050. [Google Scholar] [CrossRef]
Lewis, R.M.; Nash, S.G. Model problems for the multigrid optimization of systems governed by differential equations. SIAM J. Sci. Comput. 2005, 26, 1811–1837. [Google Scholar] [CrossRef]
Huang, B.; Symonds, N.O.; Lilienfeld, O.A.v. Handbook of Materials Modeling: Methods: Theory and Modeling; Andreoni, W., Yip, S., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 1–27. [Google Scholar]
Ramakrishnan, R.; Dral, P.O.; Rupp, M.; Von Lilienfeld, O.A. Big data meets quantum chemistry approximations: The Δ-machine learning approach. J. Chem. Theory Comput. 2015, 11, 2087–2096. [Google Scholar] [CrossRef] [PubMed]
Alexandrov, N.M.; Dennis, J.E., Jr.; Lewis, R.M.; Torczon, V. A trust-region framework for managing the use of approximation models in optimization. Struct. Optim. 1998, 15, 16–23. [Google Scholar] [CrossRef]
Zhang, Y.; Kim, N.H.; Park, C.; Haftka, R.T. Multifidelity surrogate based on single linear regression. AIAA J. 2018, 56, 4944–4952. [Google Scholar] [CrossRef]
Williams, C.; Rasmussen, C. Gaussian processes for regression. Adv. Neural Inf. Process. Syst. 1995, 8, 514–520. [Google Scholar]
Seeger, M. Gaussian processes for machine learning. Int. J. Neural Syst. 2004, 14, 69–106. [Google Scholar] [CrossRef]
Gano, S.; Sanders, B.; Renaud, J. Variable Fidelity Optimization Using a Kriging Based Scaling Function. In Proceedings of the 10th AIAA/ISSMO MDAO Conference, Albany, NY, USA, 30 August–1 September 2004; p. 4460. [Google Scholar] [CrossRef]
Gano, S.E.; Renaud, J.E.; Martin, J.D.; Simpson, T.W. Update strategies for kriging models used in variable fidelity optimization. Struct. Multidisc. Optim. 2006, 32, 287–298. [Google Scholar] [CrossRef]
Babaee, H.; Perdikaris, P.; Chryssostomidis, C.; Karniadakis, G. Multi-fidelity modelling of mixed convection based on experimental correlations and numerical simulations. J. Fluid Mech. 2016, 809, 895–917. [Google Scholar] [CrossRef]
Rao, R.B.; Fung, G.; Rosales, R. On the dangers of cross-validation. An experimental evaluation. In Proceedings of the 2008 SIAM International Conference on Data Mining, Atlanta, Georgia, 24–26 April 2008; pp. 588–596. [Google Scholar] [CrossRef]
Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the IJCAI, Montreal, QC, Canada, 20–25 August 1995; Volume 14, pp. 1137–1145. [Google Scholar]
Shapiro, S.S.; Wilk, M.B. An analysis of variance test for normality (complete samples). Biometrika 1965, 52, 591–611. [Google Scholar] [CrossRef]
Levene, H. Robust tests for equality of variances. Contrib. Probab. Stat. 1960, 278–292. [Google Scholar]

Figure 1. Multi-fidelity data in different fields.

Figure 2. Distribution of the band gap obtained by different functions in MP. Four functionals were included: PBE (52,348 samples), HSE (6030 samples), GLLB (2290 samples), and SCAN (472 samples). PBE accounted for the largest proportion of 85.6%, followed by HSE (9.9%), GLLB (3.7%), and SCAN (0.8%).

Figure 3. The existing datasets of the experimental ultraviolet/visible spectroscopic properties. There are nine datasets in the figure, and the numbers are as follows in descending order: Deep4Chem (20,236 samples), ChemDataExtractor (8467 samples), DSSCDB (5178 samples), ChemFlour (4386 samples), Dye aggregation (4043 samples), NIST (2306 samples), Fluorophores.org (955 samples), PhotochemCAD (552 samples), and UV/Vis+ (112 samples).

Figure 4. All true value functions. (a) one hundred functions of true value in the performance validation. The bolded one is the function with the parameters in Equation (13). (b) enlarged plot of the true value function shown.

Figure 5. The results of one validation among one hundred tests with the parameters in Equation (13). (a,c,e), respectively, show the distribution of multi-fidelity datasets under additive, multiplicative, and mixed noise, where

X_{l}

represents low-fidelity data and is represented by blue dots;

X_{h}

represents high-fidelity data, represented by a red triangle. (b,d,f) show the curves of three correction methods, true values, and high-fidelity data under corresponding noise, where the true values are represented by blue dashed lines; HFDM represents the high-fidelity data fitting model, in which the fitting method adopts GPR and is represented by the red dotted line; AC, MC, and CC represent additive, multiplicative, and comprehensive correction methods, respectively, and are represented by solid orange, green, and yellow lines, respectively. In addition, the image on the far right is the corresponding enlarged image.

Figure 5. The results of one validation among one hundred tests with the parameters in Equation (13). (a,c,e), respectively, show the distribution of multi-fidelity datasets under additive, multiplicative, and mixed noise, where

X_{l}

represents low-fidelity data and is represented by blue dots;

X_{h}

represents high-fidelity data, represented by a red triangle. (b,d,f) show the curves of three correction methods, true values, and high-fidelity data under corresponding noise, where the true values are represented by blue dashed lines; HFDM represents the high-fidelity data fitting model, in which the fitting method adopts GPR and is represented by the red dotted line; AC, MC, and CC represent additive, multiplicative, and comprehensive correction methods, respectively, and are represented by solid orange, green, and yellow lines, respectively. In addition, the image on the far right is the corresponding enlarged image.

Table 1. Different acquisition methods of MF data in different fields, including their respective references and citation counts (As of 3 December 2023).

Materials Science	Different Algoritdms	functional	[14,15,16,17,18] citations (273, 60, 4, 43, 25)
		experiment	[19] citations (18)
		functional and experiment	[1,2] citations (1, 71)
		physical models	[20] citations (14)
		functional and physical models	[9] citations (55)
	Different hyperparameters	mesh size	[21] citations (145)
		time step	[22,23] citations (21, 23)
		different hyper parameters	[24] citations (35)
Aerospace Science	Different Algorithms	CFD methodology	[25,26,27,28,29,30,31,32] citations (52, 19, 12, 40, 3, 6, 5, 20)
		CFD tool	[33] citations (16)
		CFD and experiment	[34] citations (14)
		multicopter dynamics and experiment	[35] citations (35)
	Different hyperparameters	mesh size	[36,37,38] citations (1, 60, 55)
Mechanical Engineering	Different Algorithms	mathematical models	[12,39,40,41,42] citations (17, 5, 40, 5, 35)
	Different Algorithms	measurement tools	[43,44] citations (43, 11)
	Different hyperparameters	mesh size	[45,46,47] citations (2, 1, 13)
		convergence condition and simulated objects	[10] citations (1)
		iteration limit	[48] citations (25)
Others	Different Algorithms	mathematical models	[49] citations (104)
	Different Algorithms	mathematical models and simulation tools	[50] citations (27)
	Different hyperparameters	main or simplified numerical model	[51] citations (1)
		mesh size	[52] citations (66)
		physical constraints	[53] citations (1)

Table 2. Distribution of high/low fidelity (HF/LF) data acquisition methods in the materials field.

Reference	LDA [58]	PBE [59]	PBE0 [60]	HSE [61,62]	Potential [63]
[9]		HF			LF
[14,15]		LF		HF
[16]	LF	LF		HF
[17]		LF	HF
[18]		LF		HF

Table 3. Average MSE of different methods on datasets with different noises, the lowest values are highlighted in bold.

Method	Average MSE
Method	Additive Noise	Multiplicative Noise	Mixed Noise
HFDM	2.545	2.531	2.389
AC	0.065	0.753	0.501
MC	0.608	0.609	1.535
CC	0.057	0.436	0.425

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Liu, X.; Chen, H.; Yang, T.; He, Y. Exploring Multi-Fidelity Data in Materials Science: Challenges, Applications, and Optimized Learning Strategies. Appl. Sci. 2023, 13, 13176. https://doi.org/10.3390/app132413176

AMA Style

Wang Z, Liu X, Chen H, Yang T, He Y. Exploring Multi-Fidelity Data in Materials Science: Challenges, Applications, and Optimized Learning Strategies. Applied Sciences. 2023; 13(24):13176. https://doi.org/10.3390/app132413176

Chicago/Turabian Style

Wang, Ziming, Xiaotong Liu, Haotian Chen, Tao Yang, and Yurong He. 2023. "Exploring Multi-Fidelity Data in Materials Science: Challenges, Applications, and Optimized Learning Strategies" Applied Sciences 13, no. 24: 13176. https://doi.org/10.3390/app132413176

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Exploring Multi-Fidelity Data in Materials Science: Challenges, Applications, and Optimized Learning Strategies

Abstract

1. Introduction

1.1. Multi-Fidelity Data from Different Algorithms

1.2. Multi-Fidelity Data from Different Hyperparameters

1.3. Multi-Fidelity Datasets under Materials Science

2. Multi-Fidelity Data Learning Methods

2.1. Additive Correction

2.2. Multiplicative Correction

2.3. Comprehensive Correction

3. Performance Validation of Existing Learning Algorithms

3.1. Multi-Fidelity Data Acquisition and Processing

3.2. Validation Results

4. Outlook and Limitations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI