Next Article in Journal
Examining the Global Patent Landscape of Artificial Intelligence-Driven Solutions for COVID-19
Previous Article in Journal
Deep Learning-Powered Optical Microscopy for Steel Research
Previous Article in Special Issue
Navigating the Multimodal Landscape: A Review on Integration of Text and Image Data in Machine Learning Architectures
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Learning Effective Good Variables from Physical Data

Department of Energy, Politecnico di Torino, C.so Duca degli Abruzzi 24, 10129 Torino, Italy
*
Author to whom correspondence should be addressed.
Mach. Learn. Knowl. Extr. 2024, 6(3), 1597-1618; https://doi.org/10.3390/make6030077
Submission received: 6 May 2024 / Revised: 1 July 2024 / Accepted: 9 July 2024 / Published: 12 July 2024

Abstract

:
We assume that a sufficiently large database is available, where a physical property of interest and a number of associated ruling primitive variables or observables are stored. We introduce and test two machine learning approaches to discover possible groups or combinations of primitive variables, regardless of data origin, being it numerical or experimental: the first approach is based on regression models, whereas the second on classification models. The variable group (here referred to as the new effective good variable) can be considered as successfully found when the physical property of interest is characterized by the following effective invariant behavior: in the first method, invariance of the group implies invariance of the property up to a given accuracy; in the other method, upon partition of the physical property values into two or more classes, invariance of the group implies invariance of the class. For the sake of illustration, the two methods are successfully applied to two popular empirical correlations describing the convective heat transfer phenomenon and to the Newton’s law of universal gravitation.

Graphical Abstract

1. Introduction

Theoretical modeling and numerical simulations have become invaluable tools for the current scientific and technological advancement across various fields [1]. In essence, models make use of a mathematical description of the physical laws by establishing the relationships between physical variables. The latter variables have the key role of describing, in a complete and non-redundant manner, a system of interest. As such, their correct identification is never a trivial task, especially in systems with little prior knowledge [2,3,4].
Often, in order to effectively model complex systems, it is favorable to search for a more convenient description with a reduced number of effective variables [5,6]. In other words, the system behavior is not described by the directly accessible physical variables, but rather by some groups or combinations. In this respect, back in the late 19th century, the Buckingham theorem was introduced [7,8,9]. In the latter approach, based on dimensional analysis, it is possible to mix the primitive physical variables, thus creating fewer dimensionless numbers which are effectively relevant.
Over the years, sophisticated methodologies have also been developed based on machine learning, data mining and other data-driven approaches [10,11,12]. Some other approaches aim to unlock the discovery of symbolic expressions accurately matching data derived from an unknown function. This problem has been tackled with a number of methods [13,14], including sparse regression [15,16,17,18], genetic algorithms [19,20,21] and physics-inspired algorithms like AI Feynman, introduced by Udrescu and Tegmark [22]. Inspired by the latter, some authors of this work presented a multi-objective optimization procedure for reducing the set of composition-based material descriptors by optimally mixing them in power combination form. This resulted in improved classification performances, as demonstrated in a case study focused on superconductors [23]. Moreover, the same procedure was applied by Bonke et al. [24] to identify an effective reduced set of variables in a micelle-based photocatalytic system for solar fuels production. Specifically, the algorithm allowed for analytically mixing five physical primitive variables into two synthetic features for the optimal binary separation of the experimental samples according to their performances. The practical benefit of such an approach lies in its capability to replace an experimental sample achieving high performance with an alternative combination of its elemental components. This gives the possibility of reducing the most expensive ones, and re-balancing the others, at no cost on the overall performance. To our knowledge, developing a robust method for identifying symmetries in descriptors, and thereby determining analytically mixed features that govern a specific phenomenon, remains an unresolved issue [2,22]. Within this framework, the scientific question we aim to address is the same of interpretability algorithms like SHAP [25], i.e., identifying relevant features over a trained ML model. However, SHAP works in the original features space, even when aggregations of such features more effectively would explain the phenomenon of interest. Other recent works deal with methods for feature selection/removal by means of genetic algorithms (but not mixing) [26], for identification of the minimal intrinsic variables (but not analytically) [3]. Also, Udrescu and Tegmark [22] consider only a few possible symmetries in the data.
In this work, we present a general and automated methodology, suitable for regression tasks, able to identify groups and/or group sets of variables in the power form x i α 1 x j α 2 x m α p . We show its effectiveness on popular thermo-fluid dynamics correlations (i.e., Dittus–Boelter and Gnielinski). Furthermore, we also demonstrate that the approach can be easily extended to a more general functional form, proving its ability on Newton’s law of universal gravitation. Also, we employ the optimal feature mixing procedure in ref. [23] for classification tasks, showing its successful application on the former two problems of this study. We refer to the resulting variable groups/sets as effective good variables, since either the examined property of interest or the class effectively depend on groups or sets of features rather than on individual features considered separately. Thus, such effective good variables comply with the definition adopted by Chen et al. [2], as a “complete and non-redundant description of the relevant system”. An overview of the two proposed methodologies is depicted in Figure 1.
The paper is organized in sections as follows. Section 2 presents the methodology for searching good variables in regression models and describes the implementation of classification models for optimal variable mixing towards class separation. This section is further subdivided into specific techniques, including the consideration of single and multiple invariant groups in power form, as well as a broader exploration into non-power forms. Section 3 presents numerical examples and discussions related to our study. We detail the procedure for generating the three datasets employed in this study, based on the three functional forms analyzed here (i.e., Dittus–Boelter correlation, Gnielinski correlation, and Newton’s law of universal gravitation). Here, we also provide a comprehensive evaluation of our approach. Finally, in Section 4, we draw conclusions providing an overview of the main contributions. We point out that, since this work is methodological, it deals only with numerically generated data. However, those methodologies are blessed by generality and, as such, are agnostic to the data origin.

2. Methods

2.1. Problem Statement

Within the context of analyzing complex physical phenomena, discovering effective combinations or groups of primitive variables that characterize a particular physical property of interest is highly desirable. Indeed, this deeper insight not only enhances understanding per se, but also offers practical benefits for optimization purposes, as already pointed out above. Herein, we introduce a methodology to detect invariant groups/sets of variables for regression tasks. In some cases, invariant groups may not be identifiable for regression tasks. Therefore, we also outline a method for identifying similar invariant groups for classification tasks, including the formulation of an analytical classifier.

2.2. Searching for Good Variables by Regression Models

2.2.1. Single Invariant Group in Power Form

Let f ( x ) = f ( x 1 , , x n ) denote a function depending on n physical variables x 1 , , x n . If f ( x ) is invariant with respect to a group of p variables in the form ( x i α 1 x j α 2 x m α p ) , α 1 , α 2 , , α p R , when this group is kept at a constant value c ¯ , regardless of the value of each component x i , x j , , x m , f ( x ) does not change. Stated differently, the above invariance condition requires:
α 1 ln ( x i ) + α 2 ln ( x j ) + + α p ln ( x m ) = c
( c = ln ( c ¯ ) ), which upon differentiation yields:
α 1 d x i x i + α 2 d x j x j + + α p d x m x m = 0 .
We can thus construct the ( 1 × p ) matrix B at a generic point x 0 ¯ = ( x i , 0 , x j , 0 , , x m , 0 ) , as
B = α 1 x i , 0 , α 2 x j , 0 , , α p x m , 0 .
Let K be a matrix whose columns form an ortho-normal basis in the null space (or kernel) of B . Hence, the condition of invariance of f ( x ) with respect to the group ( x i α 1 x j α 2 x m α p ) can be recast into an orthogonality condition in the configuration space between the normalized gradient of f ( x ) and each column of K, namely
f ˜ x 0 ¯ · K = 0 ,
where
f ˜ = f norm ( f ) .
Coefficients α 1 , α 2 , , α p can be conveniently normalized, thus yielding the following algebraic system:
f ˜ x 0 ¯ · K = 0 α 1 2 + α 2 2 + + α p 2 = 1 .
If the non-linear system in Equation (6) is satisfied for the same exponents ( α 1 ¯ , α 2 ¯ , …, α p ¯ ) over all the domains of the features ( x i , x j , …, x m ), the group x i α 1 ¯ x j α 2 ¯ x m α p ¯ represents an intrinsic variable and f ( x ) is invariant with respect to that group.

2.2.2. Multiple Concurrent Invariant Groups in Power Form

Clearly, the above procedure can be extended to a function f ( x ) being invariant with respect to a higher number of feature groups, with each group even sharing some of the primitive physical variables. Without losing generality, and for the sake of simplicity, we limit this generalized description to sets consisting of two concurrent groups.
Let f ( x ) = f ( x 1 , , x n ) = f ( x 1 , , x i α 1 x j α 2 x m α p , x j β 1 x k β 2 x r β q , , x n ) denote a function where two groups share a generic primitive variable x j . In this case, the above procedure applied to individual groups x i α 1 x j α 2 x m α p and x j β 1 x k β 2 x r β q independently is not suitable any longer and requires a generalization as discussed below.
To investigate the invariance with respect to both groups, we now construct the ( 2 × l ) matrix B :
B = α 1 x i , 0 α 2 x j , 0 0 α p x m , 0 0 0 β 1 x j , 0 β 2 x k , 0 0 β q x r , 0 ,
with max ( p , q ) l p + q , K being a matrix whose columns represent an ortho-normal basis of the null space of B .
As performed above, the condition of invariance requires that the normalized gradient of f ( x ) is orthogonal to each column of the kernel matrix K (see also Equation (4)).
Adding the normalization condition of the coefficients α 1 , α 2 , , α p and β 1 , β 2 , , β q , we obtain:
f ˜ x 0 ¯ · K = 0 α 1 2 + α 2 2 + + α p 2 = 1 β 1 2 + β 2 2 + + β q 2 = 1
with the partial derivatives being evaluated in x 0 ¯ . If the non-linear system in Equation (8) is satisfied for the same exponents ( α 1 ¯ , …, α p ¯ , β 1 ¯ , β q ¯ ) in the entire feature domain ( x i , x j , …, x r ), the two groups x i α 1 ¯ x j α 2 ¯ x m α p ¯ and x j β 1 ¯ x k β 2 ¯ x r β q ¯ are intrinsic variables as the function f ( x ) is invariant with respect to both groups.
It is worth stressing that more general cases with three or more concurrent invariant groups will imply additional rows for the matrix B . Furthermore, in general, the non-linear system (8) is not necessarily closed (see Section 2.2.4).

2.2.3. Further Generalization to Non Power Forms

The invariant group/set identification procedure, introduced for groups in the power form, can be generalized to other functional relationships. For the sake of illustration, in the following, we restrict to a single invariant group.
Let f ( x ) = f ( x 1 , , x n ) denote a function invariant with respect to a group involving p variables according to a generic functional dependence g ( x i , x j , , x m ) , such that f ( x ) = f ( x 1 , , g ( x i , x j , , x m ) , x n ) . When such group is a constant c ¯ , even varying its components x i , x j , , x m separately, f ( x ) does not change. This yields:
g ( x i , x j , , x m p ) = c ¯
which translates into:
g x i d x i + g x j d x j + + g x m d x m = 0 .
We can thus construct the ( 1 × p ) matrix B at a generic point x 0 ¯ = ( x i , 0 , x j , 0 , , x m , 0 ) as
B = g x i , g x j , , g x m .
Let K be a matrix whose columns represent an orthonormal basis of the null space of B . Applying the invariance condition of Equation (4) leads to a non-linear system analogous to Equation (6). Notably, the procedure illustrated here can be further extended to sets of groups in any functional form, following a reasoning similar to that applied in Section 2.2.2.

2.2.4. Regression Model and Procedure Implementation

In this study, we assume that a database of physical data is available. Each sample in our database consists of n features ( x 1 , , x n ) corresponding to a target quantity f ( x ) . We further assume that the target quantity possibly depends on some invariant groups expressed for instance in the power form. We thus try to detect such invariance following the above procedure.
As already described, this requires the evaluation of the gradient f ( x ) . To this end, f ( x ) can be conveniently approximated with a Deep Neural Network (DNN), allowing the computation of that gradient by means of automatic differentiation. The choice of a DNN (instead of a simpler approximator) is also beneficial to the method’s versatility, making it suitable for the generalized cases of non power forms. All of the DNNs of this study are trained and validated over the 85% of the corresponding databases—of which the 85% is used for the training and the remaining 15% for the validation—and tested over the remaining 15%. The DNN structure, as reported in Supplementary Note S2, is used for all the examples of this study. Upon network training and validation, automatic differentiation is utilized to calculate the gradient of the function at a specific point x 0 within its domain. Here, the variables in the examined group are randomly chosen within their domains, while the values of the remaining variables are held constant at their averages in the original database.
The function’s gradient is computed and applied in Equations (6) and (8), depending on the specific case. Subsequently, the system is solved using a least-squares optimization method that utilizes the Trust Region Reflective [27] algorithm, which is suitable also for under-determined (non closed) systems that may arise during the process of identifying sets of groups.
This strategy is repeated with 20 different values of x 0 and iterated multiple times, each time updating the initial guess to the mean value of the exponents ( α 1 ¯ , α 2 ¯ , , α p ¯ ) averaged over the 20 solutions found in the previous iteration.
An overview of the methodology for identifying invariant groups/sets is depicted in Figure 2.

2.3. Searching for Good Variables by Classification Models

The methodology introduced above, based on regression models, is able to identify groups and/or group sets of physical variables upon which a generic function f ( x ) depends. If successful, this approach efficiently reduces the number of such primitive variables by optimizing their combination.
However, a similar goal can be achieved by means of an entirely different approach based on classification models, and it has been initially proposed by Trezza and Chiavazzo in [23]. The latter methodology, briefly reviewed in the following, enables the optimal mixing of primitive features by performing classification and subsequent optimal class separation of the available data samples.
Let x 1 , , x n denote n features. Let ( x ˜ 1 , , x ˜ n ) be the corresponding dimensionless quantities
x ˜ i = x i x i , min x i , max x i , min + 1
normalized by construction within the range [ 1 , 2 ] to avoid singularities in the procedure below, where x i , min and x i , max represent the minimum and the maximum observed values for the ith feature over the training set, respectively. It is possible to synthetically create a set of m n mixed features ( y 1 , , y m ) as
y j = i = 1 n x ˜ i α i j
with { α i j } being an ( n × m ) matrix estimated by means of a multi-objective optimization algorithm, as described below. Finally, the new variables y j can be normalized within the interval [ 0 , 1 ] according to
y ˜ j = y j y j , min y j , max y j , min .
The main idea is that the matrix α i j shall be selected following an optimization criterion attempting the largest possible separation between two (or more) different classes partitioning the values of a physical property of interest. Clearly, this can be accomplished by maximizing a certain distance between the classes. However, a multi-objective optimization procedure could also be pursued.
For instance, in a binary classification, the matrix α i j in Equation (13) can be chosen to lie on the Pareto front while simultaneously pursuing: (i) maximization of a carefully selected distance between the two classes; (ii) minimization of a norm of the covariance matrix of the first class distribution; and (iii) minimization of a norm of the covariance matrix of the second class distribution.
The main rationale behind the minimization of a norm of the covariance matrix for class distribution (along with a distance between classes) is the aim of possibly obtaining smooth distribution functions that can be analytically bet fitted. Following the latter idea, below, we will pursue multi-objective optimization for all the examined cases and provide some optimal solutions from Pareto fronts.
In this study, we adopt genetic algorithms for optimization whereas the Bhattacharyya distance between the histograms of the two equally binned classes [28,29] is evaluated to be maximized during the multi-objective optimization. However, as shown in ref. [23], other options for class distance may be considered, such as the Wasserstein distance [30] or the averaged number within a fixed radius of nearest neighbors of one class to each samples of the other class. Furthermore, herein we utilize variable power-form mixing outlined in Equation (13). Nonetheless, alternative grouping options (e.g., linear mixing as employed in ref. [23]) are also possible with the overall methodology remaining unchanged. Also, the procedure can be further generalized to a number of classes > 2 .
In this case, the genetic algorithm aims at simultaneously maximizing the pairwise distances between the classes [23]. For the remaining two objectives, in one-dimensional cases, a numerical estimate of the standard deviation of the binned data in the two classes is computed. In two-dimensional (or higher) cases, the determinant of the covariance matrix can be utilized. From the practical standpoint, considering a dataset of physical data samples characterized by features x 1 , , x n and their corresponding response f ( x ) , such a dataset is partitioned into two classes based on a carefully chosen threshold for f ( x ) . After the aforementioned pre-processing steps, a genetic algorithm optimization is employed to identify the Pareto front. This optimization concurrently seeks to minimize the variances of the two classes (in one dimension) or the determinants of the covariances of the two classes (in two or more dimensions), while also maximizing the Bhattacharyya distance between the classes. A summary of this procedure is illustrated in Figure 3.

3. Numerical Examples and Discussion

3.1. Datasets Creation

We first create three datasets from physical models to be used to train and test Machine Learning (ML) tools, aiming to predict physical properties of interest. More specifically, the first two datasets are generated using popular thermo-fluid dynamic correlations (i.e., Dittus–Boelter and Gnielinski correlations) with the target quantity being the Nusselt number. Those datasets are created starting from the physical properties of 16 real liquids at ambient temperature and pressure [31], and evaluating the kinematic viscosity ν and the thermal diffusivity κ of each liquid from tabulated data (see Supplementary Note S1). Each fluid accounts for 500 value combinations of the flow speed u, the hydraulic diameter D, and the friction factor f (the latter is only requested for the Gnielinski correlation). The values of these three variables are randomly chosen in the ranges [ 0.1 , 1 ] , [ 0.01 , 0.1 ] , and [ 0.02 , 0.09 ] , respectively. This leads to a total of 8000 samples for each dataset. The Nusselt number is thus calculated according to the corresponding equations. Finally, emulating what is typically experienced in experimental measurements, noise is added on top of the correct target values, sampling 8000 points η i from a Gaussian distribution, with mean μ = 0 and standard deviation σ = 0.1 ; the target value of each entry Nu i is then multiplied by ( 1 + η i ) . The third dataset is created on the basis of Newton’s law of universal gravitation. It is constructed in a similar fashion, by generating random values within the range [ 10 16 , 10 18 ] for the masses m 1 and m 2 , and within the ranges [ 10 12 , 10 10 ] and [ 10 10 , 10 12 ] for the spatial coordinates x 1 , y 1 , z 1 , x 2 , y 2 , z 2 . Here, the target value is the gravitational force F g , which is computed with some noise added following the same procedure as above.

3.2. Dittus–Boelter Equation

The Dittus–Boelter correlation for a fluid undergoing heating is expressed by the following equation:
Nu = 0.023 Re D 4 / 5 Pr 0.4
where Re D = u d ν represents the Reynolds number and Pr = ν κ denotes the Prandtl number. The equation can be rewritten by substituting the dimensionless quantities R e and P r , thus obtaining
Nu = 0.023 u d ν 4 / 5 ν κ 0.4 = 0.023 u 0.8 d 0.8 ν 0.4 κ 0.4 .
It is easy to verify that the objective value is invariant with respect to all the possible combinations of input features, u , d , ν , κ , either couples or triplets.

3.2.1. Use of Regression Models

We attempt to recover the symmetries of the Nusselt number with respect to binary and ternary groups in the form x i α 1 x j α 2 and x i α 1 x j α 2 x k α 3 , only relying on noised data and by adopting the methodology illustrated in Section 2.2. As outlined above, this requires the evaluation of the gradient of the noised Nusselt number with respect to the input features, namely Nu ¯ ( u , d , ν , κ ) , where for each sample i, Nu ¯ i = Nu i ( 1 + η i ) (see Section 2 for details). This can be conveniently computed by automatic differentiation over a DNN model approximating Nu ¯ ( u , d , ν , κ ) . As input features of the DNN, the four variables u , d , ν , κ are used. The dataset is thus divided into three parts: (i) a training set, (ii) a validation set used to detect potential overfitting, and (iii) a testing set for the comprehensive evaluation of model performance. Figure 4a showcases the model predictions over the testing set, while in Figure 4b, the corresponding loss across epochs is depicted.
Notably, the model is highly predictive, with coefficient of determination R 2 = 0.963 over the testing set, with no evidence of overfitting observed. The model is thus fed to the optimization algorithm. For each of the ten expected invariant groups (six binary and four ternary), the algorithm estimates the normalized exponents α 1 , α 2 , α 3 20 times. Table 1 shows such true normalized exponents α 1 , α 2 , α 3 , together with the means μ α 1 , μ α 2 , μ α 3 and the standard deviations σ α 1 , σ α 2 , σ α 3 over those 20 evaluations. Our findings indicate that the method correctly identifies invariant groups in both couple and triplet forms. Furthermore, it demonstrates the ability to estimate normalized exponents α 1 , α 2 , α 3 for individual primitive variables within each group, with relative percent errors not exceeding 11.0% for couples and 16.6% for triplets. Clearly, we notice that normalized exponents are obtained up to the sign: the exponents determined for the pair ( u , d ) are negative instead of the expected positive values according to the Dittus–Boelter equation.
In Table 1, we label as found, all the groups with standard deviations σ α 1 , σ α 2 , σ α 3 0.2 , since low values of σ imply that the results for the 20 evaluations are consistent. We also label as reliable all the groups where none of the evaluated exponents approach 0 or 1, since such cases would correspond to trivial solutions. Remarkably, all six binary groups and all four ternary groups of input features are found to comply with the condition of invariance, and the results for all of them are reliable.

3.2.2. Use of Classification Models

In a second attempt to reduce the number of input variables, we investigate the possible existence of symmetries for classification, aiming at the construction of m mixed features in the form y j = i = 1 n x ˜ i α i j , where ( x ˜ 1 , , x ˜ n ) are the primitive variables properly normalized within the interval [1, 2] and { α i j } denotes an n × m matrix optimally estimated, as detailed in Section 2.3. Such features are finally properly normalized as y ˜ j in the interval [0, 1]. In the case of the Dittus-Bolter, we set class 1 for samples with Nu ¯ < 395 and class 2 for samples with Nu ¯ 395 . We thus create a single mixed feature ( m = 1 ) according to the above procedure. Figure 5a reports the Probability Density Function (PDF) binning of the training set data over the two classes (a higher value of the PDF means a higher number of items in the corresponding bin), against the normalized flow velocity u norm , while Figure 5b shows the same PDFs (i.e., class distributions) against the normalized mixed feature y ˜ 1 . It is noteworthy to observe that when represented against the primitive feature, the two classes exhibit considerable overlap, whereas the same two classes appear well separated when plotted against the mixed feature. As a result, it is particularly convenient to attempt an analytical bet-fitting of the two distributions depicted in Figure 5b approximated by a Generalized Extreme Value (GEV) distribution, with density
g ( y ˜ 1 ) = 1 σ 1 + ω y ˜ 1 μ σ ω + 1 ω exp 1 + ω y ˜ 1 μ σ 1 / ω .
The fitting is performed by means of the SciPy Python package [32]. The specific computed GEV distribution for samples with Nu ¯ < 395 has factors μ = 0.338 , σ = 0.137 , and ω = 0.342 , whereas the GEV distribution for samples with Nu ¯ 395 has factors μ = 0.675 , σ = 0.077 , and ω = 0.074 . Figure 5c shows the PDFs over the binned data of the testing set reported against the same mixed feature y ˜ 1 , along with the GEV fittings computed on the training set. Notably, the classes are still well separated, with a good agreement between the GEV distributions and the testing set densities.
Furthermore, we make an attempt at the creation of two mixed normalized features y ˜ 1 , y ˜ 2 ( m = 2 ) with the same classes. Specifically, Figure 6a shows the PDF two dimensional binning of the training set data over the two classes, against the normalized flow velocity u norm and the hydraulic diameter d norm . Figure 6b shows the same PDFs against the normalized mixed features y ˜ 1 , y ˜ 2 constructed according to Equation (13) by power combination of the four primitive variables and choosing the point of the Pareto front with the least overlapping of the two distributions. Similarly to the one dimensional case, the classes appear well separated when plotted against the mixed features, whereas there is a wide overlap when plotted against the primitive variables. Figure 6c shows the PDFs over the binned data of the testing set reported against the same mixed features. The two classes still appear well separated.
Finally, we aim at the construction of m = 1 mixed feature for separating the data samples into three classes: class 0 for Nu ¯ < 197.5 , class 1 for 197.5 Nu ¯ < 395 , class 2 for Nu ¯ 395 . The multi-objective optimization is performed aiming at concurrently maximizing the pairwise distances between the classes. Figure 7 shows the separation in classes reported against the normalized flow velocity u norm and the new mixed feature y ˜ 1 : in line with previous cases, the substantial overlap observed when representing classes against the primitive feature is significantly reduced when using the optimized mixed feature. The bet-fitting GEV distributions for the three classes are computed according to Equation (17) over the training set samples. Specifically the GEV associated with Class 0 has factors μ = 0.284 , σ = 0.103 , ω = 0.494 , the GEV for Class 1 has factors μ = 0.475 , σ = 0.067 , ω = 0.193 , and the GEV for Class 2 has factors μ = 0.684 , σ = 0.080 , ω = 0.103 . When plotting the PDFs over the binned data of the testing set against the mixed feature, the classes appear well separated, and there is good agreement between the GEV distributions and the testing set densities.

3.3. Gnielinski Correlation

The Gnielinski correlation for turbulent flow in tubes is expressed by the equation
Nu = ( f / 8 ) ( Re D 1000 ) Pr 1 + 12.7 ( f / 8 ) 1 / 2 ( Pr 2 / 3 1 )
where Re D = u d ν represents the Reynolds number, Pr = ν κ is the Prandtl number and f denotes the friction factor. The equation can be reformulated by substituting the dimensionless quantities R e and P r , resulting in:
Nu = ( f / 8 ) u d ν 1000 ν κ 1 + 12.7 ( f / 8 ) 1 / 2 ν κ 2 / 3 1 .
The Nusselt number is invariant only with respect to the combination of flow velocity and hydraulic diameter ( u , d ) with real exponents ( 1 , 1 ) , normalized exponents ( 0.707 , 0.707 ) .

3.3.1. Use of Regression Models

Here, we first attempt the detection of possible symmetries of the noised Nusselt number Nu ¯ adopting the methodology proposed in Section 2.2.2. We obtain access to the gradient Nu ¯ ( u , d , ν , κ , f ) by means of automatic differentiation over a DNN approximating Nu ¯ ( u , d , ν , κ , f ) . Figure 8a shows the model predictions over the testing set, while in Figure 8b the corresponding loss across epochs is depicted. The model is highly predictive, with coefficient of determination R 2 = 0.978 over the testing set, and no evidence of overfitting is observed. The trained model is thus fed to the optimization algorithm, looking for binary groups in the form x i α 1 x j α 2 . The average μ α 1 , μ α 2 estimates of the exponents α 1 , α 2 , together with their standard deviations σ α 1 , σ α 2 , are reported in Table 2.
Specifically, the procedure correctly identifies the expected group ( u , d ) , whereas for most other groups, the results correspond to either the trivial solution (i.e., the absolute value of one of the exponents is close to 1 and the other is close to 0), or at least one of the evaluated values presents too high variance (i.e., σ α i > 0.2 ). Remarkably, for the remaining cases (namely ( u , ν ) , ( u , f ) , ( d , ν ) , ( d , f ) , ( ν , κ ) , ( ν , f ) ), the procedure identifies groups complying with the invariance condition, albeit not explicitly expected in the Gnielinski correlation. We thus concluded that the suggested procedure has detected a local invariance and this can be numerically verified as follows (here we limit to groups of two variables only). First, the features not appearing in the group are kept constant. Second, a random sample i in the dataset is considered for the evaluation of the quantity c ˜ = x 1 , i α 1 x 2 , i α 2 . Third, a vector x 1 ¯ of evenly spaced points in the domain of the first feature is created. As such, the array corresponding to the second feature is obtained as x 2 ¯ = ( c ˜ x 1 ¯ α 1 ) 1 / α 2 . A new dataset is thus constructed, with the variables out of the group being constant and with the variables in the group being replaced by x 1 ¯ , x 2 ¯ . For all those samples, the response value f ( x ) is computed. The local invariance is demonstrated when f ( x ) is approximately constant over all the new constructed dataset. Further details can be found in Supplementary Note S3.
Furthermore, armed with the same DNN regression model, we try to detect invariance of the Nusselt number with respect to sets groups in the form x i α 1 x j α 2 x k α 3 and x k β 1 x l β 2 employing the procedure presented in Section 2.2.2. Table 2 also reports the results for selected sets of two feature couples, including sets appearing explicitly in the Gnielinski correlation—i.e., [ ( u , ν ) , ( ν , κ ) ] and [ ( d , ν ) , ( ν , κ ) ] —and for selected sets of features comprising a triplet and a couple—e.g., [ ( u , d , ν ) , ( ν , κ ) ] , which is extremely relevant as it comprises the Reynolds and the Prandtl numbers. For all of the mentioned sets, the procedure correctly identifies the exponents that make the groups compliant with the condition of invariance. Indeed, the normalized exponents obtained with the procedure for the sets [ ( u , ν ) , ( ν , κ ) ] and [ ( d , ν ) , ( ν , κ ) ] are [ ( 0.730 , 0.680 ) , ( 0.656 , 0.755 ) ] and [ ( 0.695 , 0.718 ) , ( 0.687 , 0.727 ) ] , respectively, whereas the analytical solution is [ ( 0.707 , 0.707 ) , ( 0.707 , 0.707 ) ] for both the cases. The normalized exponents obtained for the set [ ( u , d , ν ) , ( ν , κ ) ] turn out to be [ ( 0.602 , 0.552 , 0.570 ) , ( 0.698 , 0.716 ) ] , whereas the analytical solution corresponds to [ ( 0.577 , 0.577 , 0.577 ) , ( 0.707 , 0.707 ) ] .
Table 2 also reports some identified sets that locally comply with the condition of invariance (see Supplementary Note S3 for more details), together with some randomly selected sets for which the procedure succeeds at finding the invariance property, but for which the results are close to the trivial solution, meaning that at least one of the evaluated exponents approaches either 0 or ± 1 , thus is not reliable. Following the same approach above, Table 2 labels groups with low standard deviations as found and all the groups in which none of the evaluated exponents approaches 0 or ± 1 as reliable. Remarkably, the analytically evident couple ( u , d ) , sets of two couples [ ( u , ν ) , ( ν , κ ) ] and [ ( d , ν ) , ( ν , κ ) ] , and set comprising a triplet and a couple [ ( u , d , ν ) , ( ν , κ ) ] , are identified. Furthermore, eight additional couples are found that locally comply with the condition of invariance, with six of them being reliable, and only one couple is not found. In the case of the selected sets of two feature couples, five more sets of two couples locally comply with the condition of invariance, with only two of them not being reliable. Finally, the two additional sets comprising a triplet and a couple are reliably found to represent a local invariance.

3.3.2. Use of Classification Models

In a second attempt at reducing the number of input variables, we aim at the construction of mixed optimized features, such as power combinations of the normalized primitive variables for classification, according to Equation (13). In the case of the Gnielinski correlation, we set class 1 for samples with Nu ¯ < 500 and class 2 for samples with Nu ¯ 500 . Once the optimization routine finds the Pareto front, the mixed features are created using the point of the Pareto front with the least overlapping of the two classes according to the Battacharyya distance. Figure 9a reports the PDF binning of the training set data over the two classes against the normalized flow velocity u norm , while Figure 9b shows the same PDFs against the normalized mixed feature y ˜ 1 ; still, when represented against the primitive variable, the two classes exhibit significant overlap, whereas they appear distinctly separated when plotted against the mixed feature. Two bet-fitting GEV distributions are computed for the two classes according to Equation (17) for samples in class 1 with factors μ = 0.276 , σ = 0.112 , ω = 0.223 ; for samples in class 2 with factors μ = 0.510 , σ = 0.091 , ω = 0.000 . Figure 9c shows the PDFs over the binned data of the testing set reported against the same mixed feature y ˜ 1 , along with the GEV fittings computed on the training set. Notably, the classes are still well separated, with a good agreement between the GEV distributions and the testing set densities. Furthermore, we attempt to create two mixed normalized features y ˜ 1 , y ˜ 2 with the same classes. Figure 10a shows the PDF two dimensional binning of the training set data over the two classes against the normalized flow velocity u norm and friction factor f norm . Figure 10b shows the same PDFs against the mixed features y ˜ 1 , y ˜ 2 constructed according to Equation (13) by power combination of the five relevant features and choosing the point of the Pareto front with the least overlapping of the classes. Similarly to the one-dimensional case, the classes appear again well separated when plotted against the mixed features, whereas there is a wide overlap when plotted against the primitive variables. Figure 10c shows the PDFs over the binned data of the testing set, reported against the same mixed features y ˜ 1 , y ˜ 2 ; the two classes still appear well separated. Finally, we aim at the construction of m = 1 mixed feature for separating the data samples into three classes: class 0 for Nu ¯ < 400 , class 1 for 400 Nu ¯ < 900 , class 2 for Nu ¯ 900 . The multi-objective optimization is performed aiming at concurrently maximizing the pairwise distances between the classes. Figure 11 shows the separation in classes reported against the normalized flow velocity u norm and the new mixed feature y ˜ 1 : in line with previous cases, the considerable overlap observed when representing classes against the primitive feature is significantly reduced when using the optimized mixed feature. The bet-fitting GEV distributions for the three classes are computed according to Equation (17) over the training set samples; specifically, the GEV associated with Class 0 has factors μ = 0.245 , σ = 0.083 , ω = 0.346 , the GEV for Class 1 has factors μ = 0.430 , σ = 0.053 , ω = 0.143 , and the GEV for Class 2 has factors μ = 0.599 , σ = 0.074 , ω = 0.017 . When plotting the PDFs over the binned data of the testing set against the mixed feature, the classes appear well separated, and there is good agreement between the GEV distributions and the testing set densities.

3.4. Newton’s Law of Universal Gravitation

The module of the interacting force F g for two objects of masses m 1 , m 2 is expressed via Newton’s law of universal gravitation
F g = G m 1 m 2 ( x 1 x 2 ) 2 + ( y 1 y 2 ) 2 + ( z 1 z 2 ) 2 ,
where G is the gravitational constant, ( x 1 , y 1 , z 1 ) and ( x 2 , y 2 , z 2 ) are the coordinates of the centers of their masses.
In this case, our aim is to identify the invariance of the noised F g ¯ with respect to groups ( x 1 x 2 ) , ( y 1 y 2 ) , ( z 1 z 2 ) . To this end, we employ the procedure presented in Section 2.2.3 for identification of general group form, focusing specifically on groups with functional dependence α 1 x i + α 2 x j . As usual, we obtain access to the gradient F g ¯ ( x 1 , x 2 , y 1 , y 2 , z 1 , z 2 , m 1 , m 2 ) by means of automatic differentiation over a DNN approximating F g ¯ ( x 1 , x 2 , y 1 , y 2 , z 1 , z 2 , m 1 , m 2 ) . Figure 12a shows the model predictions over the testing set, while in Figure 12b, the corresponding loss across epochs is depicted; specifically, the model is highly predictive, with coefficient of determination R 2 = 0.962 and no evidence of overfitting is observed. The trained model is thus fed to the optimization algorithm. The average μ α 1 , μ α 2 estimates of the coefficients α 1 , α 2 , together with their standard deviations σ α 1 , σ α 2 , are reported in Table 3. Specifically, the procedure correctly identifies the groups ( x 1 , x 2 ) , ( y 1 , y 2 ) , ( z 1 , z 2 ) with estimates μ α 1 , μ α 2 of ( 0.679 , 0.721 ) , ( 0.711 , 0.702 ) , ( 0.683 , 0.722 ) , respectively, compared to the true normalized coefficients ( 0.707 , 0.707 ) for all cases. Table 3 flags groups with low standard deviations as found and all the groups in which none of the evaluated exponents approaches 0 or 1 as reliable. Specifically, other variables couples show high variance over the corresponding average estimates, meaning that no further invariant group is identified.

4. Conclusions and Final Remarks

In this study, we have implemented two innovative methodologies for searching optimal variables to describe physical data, making use of both regression and classification models applied to the data. Specifically, leveraged on well-suited datasets for machine learning, with the goal of predicting a property of interest.
In particular, the methodology introduced for regression tasks has the ambition to find groups of variables that are valid over all the parameter space spanned by the available data. However, due to various factors such as noise in the data, this method might not converge, even if the group exists. Conversely, the methodology introduced for classification tasks gives up on this ambition in favor of the greater robustness of an optimization framework, aiming to only find possible separations in the parameter space. This can be useful, for instance, to find optimal areas in the parameters space with the highest values of a properly defined objective function of interest, even if the group does not govern globally such objective function. Due to their different objectives, the two methods generally do not produce the same groups.
More precisely, the procedure based on the regression model introduced here enables the identification of invariant groups and/or sets of variable groups with respect to which the property of interest is invariant. We have demonstrated its effectiveness on noised data generated by three well known functional relationships: the Dittus–Boelter correlation, the Gnielinski correlation, and Newton’s law of universal gravitation. It is noteworthy that we did not apply any particular noise reduction technique to the three datasets; instead, our methods directly handle the noisy data as-is, demonstrating their robustness. Also, we acknowledge that direct measurement of the Nusselt number would not be feasible in real world scenarios, and that the calculation method of the examined properties of interest (i.e., Nusselt number and gravitational force) is already established. However, our selection of such examples is only aimed at showcasing the effectiveness of our methodologies in sufficiently simple cases. For the former two cases, the procedure has accurately detected groups/sets in power form, while for the latter case, a generalized algorithm successfully identified groups with a linear form. It is worth stressing that, in all the examples, our procedure did not end up with any false negative, i.e., whenever a group exists, it is correctly identified. Interestingly, the methodology is potentially applicable to any functional form, as illustrated in Section 2.2.3. Additionally, the effectiveness of identifying a group in a situation where that group is already known—like in this study—depends on a tolerance threshold set between the identified group and the actual one. In real world scenarios, the identified groups serve as candidates of mixed variables, whose validity has to be verified a posteriori.
The procedure based on classification of the physical property values, allows for the determination of an appropriate set of exponents to combine all the primitive variables in power form, thus constructing mixed features optimized for the classification task. Specifically, we have shown its effectiveness on the Dittus–Boelter and on the Gnielinski correlations. Indeed, the methodology effectively enables the separation of classes even with just one mixed feature, whereas a single primitive variable fails to achieve class separation. Furthermore, we also provide examples with one and two mixed variables, together with separation in two or three classes.
It is worth stressing that the methodologies presented in this study are blessed by generality and, as such, are not restricted to the selected case studies, and are agnostic in terms of data origin, being it numerical or experimental. Therefore, potential applications can be envisioned in various other fields in the future. In particular, we notice that the identification of effective good variables is not only interesting per se to possibly gain a deeper insight on a physical system, but can also be practically advantageous when designing experiments. Indeed, the methodology to detect groups/sets in regression facilitates efficient group/set-level adjustments rather than individually fine-tuning variables within the groups/sets themselves.
Moreover, the classification procedure can enable the reduction of the numerous original primitive variables to a minimal set of optimized variables concerning a specific physical property of interest. Notably, combinations of variables yielding identical mixed feature values exhibit similar performance in terms of the property being classified. As a result, it is possible to find alternative combinations of primitive variables without compromising the overall performance of a given system under study. From a more practical standpoint, the methodologies described here can help generalizing optimal system conditions, thus helping decreasing the most resource-intensive components while properly re-balancing the others. As an example, these general methodologies may thus hold the potential to save resources, such as costly reagents (e.g., Bonke et al. [24] for solar fuel production) or expensive materials as in perovskite solar cells optimization [33]. An additional advantage associated with the correct identification of a reduced set of ruling variables is their possible use for driving sequential learning or Bayesian optimization processes [34,35].
Finally, we believe that an interesting development of the presented methodologies to be pursued in the future, shall be in the direction of a possible handling of systems ruled not only by numerical parameters, but also categorical ones.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/make6030077/s1, Supplementary Note S1: Properties of liquids; Supplementary Note S2: DNN structure; Supplementary Note S3: Local invariance; Supplementary Note S4: Mixed optimized variables for classification.

Author Contributions

Conceptualization, E.C.; methodology, E.C. and G.T.; software, E.C., G.T. and G.B.; validation, G.B. and G.T.; formal analysis, E.C., G.T. and G.B.; investigation, E.C., G.T. and G.B.; resources, E.C.; data curation, G.B. and G.T.; writing—original draft preparation, G.B.; writing—review and editing, G.B., G.T. and E.C.; visualization, G.B. and G.T.; supervision, E.C.; project administration, E.C.; funding acquisition, E.C. All authors have read and agreed to the published version of the manuscript.

Funding

We acknowledge funding under the National Recovery and Resilience Plan (NRRP), Mission 4 Component 2 Investment 1.3—Call for tender No. 1561 of 11.10.2022 of Ministero dell’Università e della Ricerca (MUR); funded by the European Union—NextGenerationEU.

Data Availability Statement

Processed datasets and trained models of this study are publicly available in Zenodo at DOI: 10.5281/zenodo.10406490. The codes used to obtain the results of this study are publicly available in github at https://github.com/giuliobarl/GoodPhysVariables accessed on 11th July 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MLMachine Learning
DNNDeep Neural Network
MAEMean Absolute Error
RMSERoot Mean Squared Error
PDFProbability Density Functions
GEVGeneralized Extreme Value

References

  1. Rappaz, M.; Bellet, M.; Deville, M.O.; Snyder, R. Numerical Modeling in Materials Science and Engineering; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
  2. Chen, B.; Huang, K.; Raghupathi, S.; Chandratreya, I.; Du, Q.; Lipson, H. Automated discovery of fundamental variables hidden in experimental data. Nat. Comput. Sci. 2022, 2, 433–442. [Google Scholar] [CrossRef] [PubMed]
  3. Floryan, D.; Graham, M.D. Data-driven discovery of intrinsic dynamics. Nat. Mach. Intell. 2022, 4, 1113–1120. [Google Scholar] [CrossRef]
  4. Eva, B.; Ried, K.; Müller, T.; Briegel, H.J. How a Minimal Learning Agent can Infer the Existence of Unobserved Variables in a Complex Environment. Minds Mach. 2023, 33, 185–219. [Google Scholar] [CrossRef] [PubMed]
  5. Chiavazzo, E. Approximation of slow and fast dynamics in multiscale dynamical systems by the linearized Relaxation Redistribution Method. J. Comput. Phys. 2012, 231, 1751–1765. [Google Scholar] [CrossRef]
  6. Chiavazzo, E.; Karlin, I.V. Quasi-equilibrium grid algorithm: Geometric construction for model reduction. J. Comput. Phys. 2008, 227, 5535–5560. [Google Scholar] [CrossRef]
  7. Rayleigh, L., VIII. On the question of the stability of the flow of fluids. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1892, 34, 59–70. [Google Scholar] [CrossRef]
  8. Buckingham, E. On physically similar systems; illustrations of the use of dimensional equations. Phys. Rev. 1914, 4, 345. [Google Scholar] [CrossRef]
  9. Curtis, W.; Logan, J.D.; Parker, W. Dimensional analysis and the pi theorem. Linear Algebra Its Appl. 1982, 47, 117–126. [Google Scholar] [CrossRef]
  10. Chiavazzo, E.; Covino, R.; Coifman, R.R.; Gear, C.W.; Georgiou, A.S.; Hummer, G.; Kevrekidis, I.G. Intrinsic map dynamics exploration for uncharted effective free-energy landscapes. Proc. Natl. Acad. Sci. USA 2017, 114, E5494–E5503. [Google Scholar] [CrossRef]
  11. Chiavazzo, E.; Gear, C.W.; Dsilva, C.J.; Rabin, N.; Kevrekidis, I.G. Reduced models in chemical kinetics via nonlinear data-mining. Processes 2014, 2, 112–140. [Google Scholar] [CrossRef]
  12. Lin, K.K.; Lu, F. Data-driven model reduction, Wiener projections, and the Koopman-Mori-Zwanzig formalism. J. Comput. Phys. 2021, 424, 109864. [Google Scholar] [CrossRef]
  13. McRee, R.K. Symbolic regression using nearest neighbor indexing. In Proceedings of the 12th Annual Conference Companion on Genetic and Evolutionary Computation, New York, NY, USA, 7–11 July 2010; pp. 1983–1990. [Google Scholar]
  14. Stijven, S.; Minnebo, W.; Vladislavleva, K. Separating the wheat from the chaff: On feature selection and feature importance in regression random forests and symbolic regression. In Proceedings of the 13th Annual Conference Companion on Genetic and Evolutionary Computation, Dublin, Ireland, 12–16 July 2011; pp. 623–630. [Google Scholar]
  15. McConaghy, T. FFX: Fast, scalable, deterministic symbolic regression technology. In Genetic Programming Theory and Practice IX; Springer: New York, NY, USA, 2011; pp. 235–260. [Google Scholar]
  16. Arnaldo, I.; O’Reilly, U.M.; Veeramachaneni, K. Building predictive models via feature synthesis. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, New York, NY, USA, 11–15 July 2015; pp. 983–990. [Google Scholar]
  17. Brunton, S.L.; Proctor, J.L.; Kutz, J.N. Discovering governing equations from data by sparse identification of nonlinear dynamical systems. Proc. Natl. Acad. Sci. USA 2016, 113, 3932–3937. [Google Scholar] [CrossRef]
  18. Quade, M.; Abel, M.; Nathan Kutz, J.; Brunton, S.L. Sparse identification of nonlinear dynamics for rapid model recovery. Chaos Interdiscip. J. Nonlinear Sci. 2018, 28, 063116. [Google Scholar] [CrossRef]
  19. Searson, D.P.; Leahy, D.E.; Willis, M.J. GPTIPS: An open source genetic programming toolbox for multigene symbolic regression. In Proceedings of the International Multiconference of Engineers and Computer Scientists, Hong Kong, China, 17–19 March 2010; Citeseer. Volume 1, pp. 77–80. [Google Scholar]
  20. Dubčáková, R. Eureqa: Software Review. 2011. Available online: https://www.researchgate.net/publication/220286070_Eureqa_software_review (accessed on 5 May 2024).
  21. Schmidt, M.; Lipson, H. Distilling free-form natural laws from experimental data. Science 2009, 324, 81–85. [Google Scholar] [CrossRef]
  22. Udrescu, S.M.; Tegmark, M. AI Feynman: A physics-inspired method for symbolic regression. Sci. Adv. 2020, 6, eaay2631. [Google Scholar] [CrossRef] [PubMed]
  23. Trezza, G.; Chiavazzo, E. Leveraging composition-based energy material descriptors for machine learning models. Mater. Today Commun. 2023, 36, 106579. [Google Scholar] [CrossRef]
  24. Bonke, S.A.; Trezza, G.; Bergamasco, L.; Song, H.; Rodríguez-Jiménez, S.; Hammarström, L.; Chiavazzo, E.; Reisner, E. Multi-Variable Multi-Metric Optimization of Self-Assembled Photocatalytic CO2 Reduction Performance Using Machine Learning Algorithms. J. Am. Chem. Soc. 2024, 146, 15648–15658. [Google Scholar] [CrossRef]
  25. Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
  26. Al-Helali, B.; Chen, Q.; Xue, B.; Zhang, M. Genetic Programming for Feature Selection Based on Feature Removal Impact in High-Dimensional Symbolic Regression. IEEE Trans. Emerg. Top. Comput. Intell. 2024, 8, 2269–2282. [Google Scholar] [CrossRef]
  27. Branch, M.A.; Coleman, T.F.; Li, Y. A subspace, interior, and conjugate gradient method for large-scale bound-constrained minimization problems. SIAM J. Sci. Comput. 1999, 21, 1–23. [Google Scholar] [CrossRef]
  28. Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distribution. Bull. Calcutta Math. Soc. 1943, 35, 99–110. [Google Scholar]
  29. Bhattacharyya, A. On a measure of divergence between two multinomial populations. Sankhyā Indian J. Stat. 1946, 7, 401–406. [Google Scholar]
  30. Villani, C. Optimal Transport: Old and New; Springer: Berlin/Heidelberg, Germany, 2009; Volume 338. [Google Scholar]
  31. Lide, D.R.; Kehiaian, H.V. CRC Handbook of Thermophysical and Thermochemical Data; CRC Press: Boca Raton, FL, USA, 2020. [Google Scholar]
  32. Virtanen, P.; Gommers, R.; Oliphant, T.E.; Haberland, M.; Reddy, T.; Cournapeau, D.; Burovski, E.; Peterson, P.; Weckesser, W.; Bright, J.; et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat. Methods 2020, 17, 261–272. [Google Scholar] [CrossRef] [PubMed]
  33. Tailor, N.K.; Abdi-Jalebi, M.; Gupta, V.; Hu, H.; Dar, M.I.; Li, G.; Satapathi, S. Recent progress in morphology optimization in perovskite solar cell. J. Mater. Chem. A 2020, 8, 21356–21386. [Google Scholar] [CrossRef]
  34. Huan, X.; Marzouk, Y.M. Simulation-based optimal Bayesian experimental design for nonlinear systems. J. Comput. Phys. 2013, 232, 288–317. [Google Scholar] [CrossRef]
  35. Motoyama, Y.; Tamura, R.; Yoshimi, K.; Terayama, K.; Ueno, T.; Tsuda, K. Bayesian optimization package: PHYSBO. Comput. Phys. Commun. 2022, 278, 108405. [Google Scholar] [CrossRef]
Figure 1. Overview of the protocol used to detect possible symmetries of a target property of interest with respect to its input variables, utilizing only data and ignoring the analytical functional dependence. Two distinct methodologies are presented: the former for identifying, in regression tasks, invariant groups in the form x i α 1 x j α 2 x m α p , among others; the latter for identifying, in classification tasks, one or several mixed features as power combinations of the input variables to achieve an optimal class separation.
Figure 1. Overview of the protocol used to detect possible symmetries of a target property of interest with respect to its input variables, utilizing only data and ignoring the analytical functional dependence. Two distinct methodologies are presented: the former for identifying, in regression tasks, invariant groups in the form x i α 1 x j α 2 x m α p , among others; the latter for identifying, in classification tasks, one or several mixed features as power combinations of the input variables to achieve an optimal class separation.
Make 06 00077 g001
Figure 2. Overview of the procedure for identifying invariant groups/sets. A regression model is trained on the physical data and used to compute the gradient of the objective function in a point x 0 . The matrix B is constructed according to the functional structure of the investigated group/set, and its kernel K is computed. Finally, the condition of invariance between gradient and kernel is coupled to the normalization conditions of the coefficients. If the resulting non-linear system is satisfied for the same coefficients over the f ( x ) domain, the group/set is an intrinsic variable and f ( x ) is invariant with respect to it.
Figure 2. Overview of the procedure for identifying invariant groups/sets. A regression model is trained on the physical data and used to compute the gradient of the objective function in a point x 0 . The matrix B is constructed according to the functional structure of the investigated group/set, and its kernel K is computed. Finally, the condition of invariance between gradient and kernel is coupled to the normalization conditions of the coefficients. If the resulting non-linear system is satisfied for the same coefficients over the f ( x ) domain, the group/set is an intrinsic variable and f ( x ) is invariant with respect to it.
Make 06 00077 g002
Figure 3. Overview of the procedure to identify optimal mixed variables for class separation. Threshold values are chosen to divide the physical data in classes. A Pareto optimization is performed to construct a reduced set of synthetic features that simultaneously maximizes the Bhattacharyya distance between the classes and (i) minimizes the variances of the class distributions, in the one dimensional case, or (ii) minimizes the determinants of the covariance matrix of the class distributions, in the multi dimensional case.
Figure 3. Overview of the procedure to identify optimal mixed variables for class separation. Threshold values are chosen to divide the physical data in classes. A Pareto optimization is performed to construct a reduced set of synthetic features that simultaneously maximizes the Bhattacharyya distance between the classes and (i) minimizes the variances of the class distributions, in the one dimensional case, or (ii) minimizes the determinants of the covariance matrix of the class distributions, in the multi dimensional case.
Make 06 00077 g003
Figure 4. Results of the DNN regression model for the noised Nusselt number Nu ¯ in the Dittus–Boelter correlation. (a) Predictions over the testing set, and (b) corresponding loss curves for the DNN model. Model performances are shown in terms of coefficient of determination R 2 , mean absolute error (MAE), and root mean squared error (RMSE).
Figure 4. Results of the DNN regression model for the noised Nusselt number Nu ¯ in the Dittus–Boelter correlation. (a) Predictions over the testing set, and (b) corresponding loss curves for the DNN model. Model performances are shown in terms of coefficient of determination R 2 , mean absolute error (MAE), and root mean squared error (RMSE).
Make 06 00077 g004
Figure 5. One dimensional example for classification on the Dittus–Boelter correlation. (a) PDFs over binned data of the training set for the two classes ( Nu ¯ < 395 and Nu ¯ 395 ) reported against the normalized flow velocity. (b) PDFs over binned data of the training set for the two classes reported against the mixed feature y 1 , constructed according to Equation (13) and choosing the point of the Pareto front with the least overlapping of the two classes according to the Bhattacharyya distance, along with a GEV analytical fitting of the two binnings. (c) PDFs over binned data of the testing set for the two classes reported against the same mixed feature y 1 together with the same GEV fittings of the (b) subfigure. The mixed variable y ˜ 1 shown here is referred exclusively to this Dittus–Boelter one dimensional optimization.
Figure 5. One dimensional example for classification on the Dittus–Boelter correlation. (a) PDFs over binned data of the training set for the two classes ( Nu ¯ < 395 and Nu ¯ 395 ) reported against the normalized flow velocity. (b) PDFs over binned data of the training set for the two classes reported against the mixed feature y 1 , constructed according to Equation (13) and choosing the point of the Pareto front with the least overlapping of the two classes according to the Bhattacharyya distance, along with a GEV analytical fitting of the two binnings. (c) PDFs over binned data of the testing set for the two classes reported against the same mixed feature y 1 together with the same GEV fittings of the (b) subfigure. The mixed variable y ˜ 1 shown here is referred exclusively to this Dittus–Boelter one dimensional optimization.
Make 06 00077 g005
Figure 6. Two dimensional example for the classification on the Dittus–Boelter correlation. (a) PDFs over binned data of the training set for the two classes ( Nu ¯ < 395 and Nu ¯ 395 ) reported against the normalized flow velocity u norm and the normalized hydraulic diameter d norm . (b) PDFs over binned data of the training set for the two classes reported against the mixed features y ˜ 1 , y ˜ 2 , constructed according to Equation (13) and choosing the point of the Pareto front with the least overlapping of the two classes according to the Bhattacharyya distance. (c) PDFs over binned data of the testing set for the two classes reported against the same mixed features y ˜ 1 , y ˜ 2 . The mixed variables y ˜ 1 , y ˜ 2 shown here are referred exclusively to this Dittus–Boelter two dimensional optimization.
Figure 6. Two dimensional example for the classification on the Dittus–Boelter correlation. (a) PDFs over binned data of the training set for the two classes ( Nu ¯ < 395 and Nu ¯ 395 ) reported against the normalized flow velocity u norm and the normalized hydraulic diameter d norm . (b) PDFs over binned data of the training set for the two classes reported against the mixed features y ˜ 1 , y ˜ 2 , constructed according to Equation (13) and choosing the point of the Pareto front with the least overlapping of the two classes according to the Bhattacharyya distance. (c) PDFs over binned data of the testing set for the two classes reported against the same mixed features y ˜ 1 , y ˜ 2 . The mixed variables y ˜ 1 , y ˜ 2 shown here are referred exclusively to this Dittus–Boelter two dimensional optimization.
Make 06 00077 g006
Figure 7. One dimensional example for the ternary classification on the Dittus–Boelter. (a) PDFs over binned data of the training set for the three classes ( Nu ¯ < 197.5 , 197.5 Nu ¯ < 395 , and Nu ¯ 395 ) reported against the normalized flow velocity u norm . (b) PDFs over binned data of the training set for the three classes reported against the mixed feature y ˜ 1 , constructed according to Equation (13) and choosing the point of the Pareto front with the least overlapping of the three classes according to the Bhattacharyya distance, along with a GEV analytical fitting of the three binnings. (c) PDFs over binned data of the testing set for the three classes reported against the same mixed feature y ˜ 1 together with the same GEV fittings of the (b) subfigure. The mixed variable y ˜ 1 shown here is referred exclusively to this Dittus–Boelter one-dimensional optimization.
Figure 7. One dimensional example for the ternary classification on the Dittus–Boelter. (a) PDFs over binned data of the training set for the three classes ( Nu ¯ < 197.5 , 197.5 Nu ¯ < 395 , and Nu ¯ 395 ) reported against the normalized flow velocity u norm . (b) PDFs over binned data of the training set for the three classes reported against the mixed feature y ˜ 1 , constructed according to Equation (13) and choosing the point of the Pareto front with the least overlapping of the three classes according to the Bhattacharyya distance, along with a GEV analytical fitting of the three binnings. (c) PDFs over binned data of the testing set for the three classes reported against the same mixed feature y ˜ 1 together with the same GEV fittings of the (b) subfigure. The mixed variable y ˜ 1 shown here is referred exclusively to this Dittus–Boelter one-dimensional optimization.
Make 06 00077 g007
Figure 8. Results of the DNN regression model for the noised Nusselt number Nu ¯ in the Gnielinski correlation. (a) Predictions over the testing set, and (b) corresponding loss curves for the DNN model. Model performances are shown in terms of coefficient of determination R 2 , mean absolute error (MAE), and root mean squared error (RMSE).
Figure 8. Results of the DNN regression model for the noised Nusselt number Nu ¯ in the Gnielinski correlation. (a) Predictions over the testing set, and (b) corresponding loss curves for the DNN model. Model performances are shown in terms of coefficient of determination R 2 , mean absolute error (MAE), and root mean squared error (RMSE).
Make 06 00077 g008
Figure 9. One dimensional example: (a) PDFs over binned data of the training set for the two classes ( Nu ¯ < 500 and Nu ¯ 500 ) reported against the normalized flow velocity. (b) PDFs over binned data of the training set for the two classes reported against the mixed feature y 1 , constructed according to Equation (13) and choosing the point of the Pareto front with the least overlapping of the two classes according to the Bhattacharyya distance, along with a GEV analytical fitting of the two binnings. (c) PDFs over binned data of the testing set for the two classes reported against the same mixed feature y 1 together with the same GEV fittings of the (b) subfigure. The mixed variable y ˜ 1 shown here is referred exclusively to this Gnielinski one dimensional optimization.
Figure 9. One dimensional example: (a) PDFs over binned data of the training set for the two classes ( Nu ¯ < 500 and Nu ¯ 500 ) reported against the normalized flow velocity. (b) PDFs over binned data of the training set for the two classes reported against the mixed feature y 1 , constructed according to Equation (13) and choosing the point of the Pareto front with the least overlapping of the two classes according to the Bhattacharyya distance, along with a GEV analytical fitting of the two binnings. (c) PDFs over binned data of the testing set for the two classes reported against the same mixed feature y 1 together with the same GEV fittings of the (b) subfigure. The mixed variable y ˜ 1 shown here is referred exclusively to this Gnielinski one dimensional optimization.
Make 06 00077 g009
Figure 10. Two dimensional example for the classification on the Gnielinski correlation: (a) PDFs over binned data of the training set for the two classes ( Nu ¯ < 500 and Nu ¯ 500 ) reported against the normalized flow velocity u norm and friction factor f. (b) PDFs over binned data of the training set for the two classes reported against the mixed features y ˜ 1 , y ˜ 2 , constructed according to Equation (13) and choosing the point of the Pareto front with the least overlapping of the two classes according to the Bhattacharyya distance. (c) PDFs over binned data of the testing set for the two classes reported against the same mixed features y ˜ 1 , y ˜ 2 . The mixed variables y ˜ 1 , y ˜ 2 shown here are referred exclusively to this Gnielinski two dimensional optimization.
Figure 10. Two dimensional example for the classification on the Gnielinski correlation: (a) PDFs over binned data of the training set for the two classes ( Nu ¯ < 500 and Nu ¯ 500 ) reported against the normalized flow velocity u norm and friction factor f. (b) PDFs over binned data of the training set for the two classes reported against the mixed features y ˜ 1 , y ˜ 2 , constructed according to Equation (13) and choosing the point of the Pareto front with the least overlapping of the two classes according to the Bhattacharyya distance. (c) PDFs over binned data of the testing set for the two classes reported against the same mixed features y ˜ 1 , y ˜ 2 . The mixed variables y ˜ 1 , y ˜ 2 shown here are referred exclusively to this Gnielinski two dimensional optimization.
Make 06 00077 g010
Figure 11. One dimensional example for the ternary classification on the Gnielinski correlation. (a) PDFs over binned data of the training set for the three classes ( Nu ¯ < 400 , 400 Nu ¯ < 900 , and Nu ¯ 900 ) reported against the normalized flow velocity u norm . (b) PDFs over binned data of the training set for the three classes reported against the mixed feature y ˜ 1 , constructed according to Equation (13) and choosing the point of the Pareto front with the least overlapping of the three classes according to the Bhattacharyya distance, along with a GEV analytical fitting of the three binnings. (c) PDFs over binned data of the testing set for the three classes reported against the same mixed feature y ˜ 1 together with the same GEV fittings of the b subfigure. The mixed variable y ˜ 1 shown here is referred exclusively to this Gnielinski one-dimensional optimization.
Figure 11. One dimensional example for the ternary classification on the Gnielinski correlation. (a) PDFs over binned data of the training set for the three classes ( Nu ¯ < 400 , 400 Nu ¯ < 900 , and Nu ¯ 900 ) reported against the normalized flow velocity u norm . (b) PDFs over binned data of the training set for the three classes reported against the mixed feature y ˜ 1 , constructed according to Equation (13) and choosing the point of the Pareto front with the least overlapping of the three classes according to the Bhattacharyya distance, along with a GEV analytical fitting of the three binnings. (c) PDFs over binned data of the testing set for the three classes reported against the same mixed feature y ˜ 1 together with the same GEV fittings of the b subfigure. The mixed variable y ˜ 1 shown here is referred exclusively to this Gnielinski one-dimensional optimization.
Make 06 00077 g011
Figure 12. Results of the DNN regression model for the noised gravitational force F g . (a) Predictions over the testing set, and (b) corresponding loss curves for the DNN model. Model performances are shown in terms of coefficient of determination R 2 , mean absolute error (MAE), and root mean squared error (RMSE).
Figure 12. Results of the DNN regression model for the noised gravitational force F g . (a) Predictions over the testing set, and (b) corresponding loss curves for the DNN model. Model performances are shown in terms of coefficient of determination R 2 , mean absolute error (MAE), and root mean squared error (RMSE).
Make 06 00077 g012
Table 1. Normalized exponents α 1 , α 2 , α 3 for the Dittus–Boelter correlation together with their average estimates over 20 evaluations μ α 1 , μ α 2 , μ α 3 and the corresponding standard deviations σ α 1 , σ α 2 , σ α 3 . Found groups refer to low standard deviation, reliable groups refer to average far from 1 and 0.
Table 1. Normalized exponents α 1 , α 2 , α 3 for the Dittus–Boelter correlation together with their average estimates over 20 evaluations μ α 1 , μ α 2 , μ α 3 and the corresponding standard deviations σ α 1 , σ α 2 , σ α 3 . Found groups refer to low standard deviation, reliable groups refer to average far from 1 and 0.
Group α 1 α 2 α 3 μ α 1 σ α 1 μ α 2 σ α 2 μ α 3 σ α 3 FoundReliable
( u , d ) 0.710.71-−0.710.05−0.700.05--yesyes
( u , ν ) 0.89−0.45-0.910.04−0.420.07--yesyes
( u , κ ) 0.89−0.45-0.890.01−0.450.03--yesyes
( d , ν ) 0.89−0.45-0.900.04−0.420.11--yesyes
( d , κ ) 0.89−0.45-0.870.02−0.500.04--yesyes
( ν , κ ) −0.71−0.71-−0.670.10−0.730.10--yesyes
( u , d , ν ) 0.670.67−0.330.680.030.650.04−0.330.07yesyes
( u , d , κ ) 0.670.67−0.330.660.030.640.03−0.370.03yesyes
( u , ν , κ ) 0.82−0.41−0.410.870.04−0.370.06−0.340.10yesyes
( d , ν , κ ) 0.82−0.41−0.410.830.06−0.350.11−0.410.05yesyes
Table 2. Normalized exponents α 1 , α 2 , α 3 , β 1 , β 2 for the Gnielinski correlation, together with their average estimates over 20 evaluations μ α 1 , μ α 2 , μ α 3 , μ β 1 , μ β 2 and the corresponding standard deviations σ α 1 , σ α 2 , σ α 3 , σ β 1 , σ β 2 . Found groups/sets refer to low standard deviation, reliable groups refer to average far from 1 and 0.
Table 2. Normalized exponents α 1 , α 2 , α 3 , β 1 , β 2 for the Gnielinski correlation, together with their average estimates over 20 evaluations μ α 1 , μ α 2 , μ α 3 , μ β 1 , μ β 2 and the corresponding standard deviations σ α 1 , σ α 2 , σ α 3 , σ β 1 , σ β 2 . Found groups/sets refer to low standard deviation, reliable groups refer to average far from 1 and 0.
Group/Set α 1 α 2 α 3 β 1 β 2 μ α 1 σ α 1 μ α 2 σ α 2 μ α 3 σ α 3 μ β 1 σ β 1 μ β 2 σ β 2 FoundReliable
( u , d ) 0.7070.707---−0.6730.067−0.7340.056------yesyes
( u , ν ) -----0.8010.022−0.5970.030------yesyes
( u , κ ) -----0.9490.010−0.3130.032------yesno
( u , f ) -----−0.8670.065−0.4680.161------yesyes
( d , ν ) -----0.8070.042−0.5870.055------yesyes
( d , κ ) -----0.9480.015−0.3140.049------yesno
( d , f ) -----0.8680.0680.4790.111------yesyes
( ν , κ ) -----−0.9180.035−0.3890.076------yesno
( ν , f ) -----0.7560.052−0.6500.056------yesyes
( κ , f ) -----0.4670.033−0.8830.017------yesyes
[ ( u , ν ) , ( ν , κ ) ] 0.707−0.707-0.707−0.7070.7300.047−0.6800.054--−0.6560.0110.7550.010yesyes
[ ( d , ν ) , ( ν , κ ) ] 0.707−0.707-0.707−0.7070.6950.028−0.7180.027--−0.6870.0080.7270.007yesyes
[ ( u , d ) , ( u , ν ) ] -----0.3130.000−0.9500.000--−0.9500.0010.3130.002yesno
[ ( u , d ) , ( d , f ) ] -----−0.5760.056−0.8140.043--−0.5930.0170.8050.013yesyes
[ ( u , ν ) , ( ν , f ) ] -----0.8720.045−0.4820.074--−0.2240.0580.9730.011yesno
[ ( u , f ) , ( f , d ) ] -----−0.6650.052−0.7440.044--−0.5030.0360.8630.020yesyes
[ ( d , f ) , ( f , ν ) ] -----0.8600.016−0.5100.026--−0.8680.0360.4910.063yesyes
[ ( u , d , ν ) , ( ν , κ ) ] 0.5770.577−0.5770.707−0.7070.6020.0430.5520.065−0.5700.0530.6980.011−0.7160.011yesyes
[ ( u , d , f ) , ( f , ν ) ] -----0.5800.0020.6600.014−0.4760.0230.8170.004−0.5760.006yesyes
[ ( u , κ , f ) , ( κ , ν ) ] -----0.4920.0120.7030.017−0.5120.0260.6250.012−0.7800.009yesyes
Table 3. Normalized exponents α 1 , α 2 for the Newton’s law of universal gravitation, together with their average estimates over 20 evaluations μ α 1 , μ α 2 , and the corresponding standard deviations σ α 1 , σ α 2 . Found groups/sets refer to low standard deviation, reliable groups refer to average far from 1 and 0.
Table 3. Normalized exponents α 1 , α 2 for the Newton’s law of universal gravitation, together with their average estimates over 20 evaluations μ α 1 , μ α 2 , and the corresponding standard deviations σ α 1 , σ α 2 . Found groups/sets refer to low standard deviation, reliable groups refer to average far from 1 and 0.
Group α 1 α 2 μ α 1 σ α 1 μ α 2 σ α 2 FoundReliable
( x 1 , x 2 ) 0.707−0.707−0.6790.1110.7210.083yesyes
( y 1 , y 2 ) 0.707−0.707−0.7110.0230.7020.023yesyes
( z 1 , z 2 ) 0.707−0.707−0.6830.0300.7220.044yesyes
( m 1 , m 2 ) --0.1050.5560.7590.322no-
( m 1 , x 1 ) --0.0000.0001.0000.000no-
( x 1 , y 1 ) --−0.6040.3690.3350.622no-
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Barletta, G.; Trezza, G.; Chiavazzo, E. Learning Effective Good Variables from Physical Data. Mach. Learn. Knowl. Extr. 2024, 6, 1597-1618. https://doi.org/10.3390/make6030077

AMA Style

Barletta G, Trezza G, Chiavazzo E. Learning Effective Good Variables from Physical Data. Machine Learning and Knowledge Extraction. 2024; 6(3):1597-1618. https://doi.org/10.3390/make6030077

Chicago/Turabian Style

Barletta, Giulio, Giovanni Trezza, and Eliodoro Chiavazzo. 2024. "Learning Effective Good Variables from Physical Data" Machine Learning and Knowledge Extraction 6, no. 3: 1597-1618. https://doi.org/10.3390/make6030077

Article Metrics

Back to TopTop