*5.2. Computationally Intensive Function Parallelization* 5.2.1. Parallelization of Generative Algorithm for PDE Discovery

The first experiment devoted to the parallelization of the atomic models' computation using partial differential equations discovery case as an example. As shown in Figure 4, the parallelization of the evolutionary algorithm in some cases does not give significant speed improvement. In cases where atomic models are computationally expensive, it is expedient to try to reduce every node computation as much as possible.

The experiment [42] was dedicated to the selection of an optimal method of computational grid domain handling. It had been previously proven, that the conventional approach when we process the entire domain at once, was able to correctly discover the governing equation. However, with the increasing size of the domain, the calculations may take longer times. In this case parallelization of the evolutionary algorithm does not give speed-up on a given computational resources configuration, since the computation of a fitness function of a single gene takes the whole computational capacity.

To solve this issue, we have proposed a method of domain division into a set of spatial subdomains to reduce the computational complexity of a single gene. For each of these subdomains, the structure of the model in form of the differential equation is discovered, and the results are compared and combined, if the equation structures are similar: with insignificant differences in coefficients or the presence of terms with higher orders of smallness. The main algorithm for the subdomains is processed in a parallel manner due to the isolated method of domain processing: we do not examine any connections between domains until the final structure of the subdomains' models is obtained.

The experiments to analyze the algorithm performance were conducted on the synthetic data: by defining the presence of a single governing equation, we exclude the issue of the existence of multiple underlying processes, described by different equations, in different parts of the studied domain. So, we have selected a solution of the wave equation with two spatial dimensions in Equation (22) for a square area, which was processed as one domain, and after that, into small fractions of subdomains.

$$\frac{\partial^2 u}{\partial t^2} = \frac{\partial^2 u}{\partial x^2} + \frac{\partial^2 u}{\partial y^2}. \tag{22}$$

However, that division has its downsides: smaller domains have less data, therefore, the disturbances (noise) in individual point will have a higher impact on the results. Furthermore, in realistic scenarios, the risks of deriving an equation, that describes a local process, increases with the decrease in domain size. The Pareto front, indicating the tradeoff between the equation discrepancy and the time efficiency, could be utilized to find the parsimonious setup of the experiment. On the noiseless data (we assume, that the derivatives are calculated without the numerical error) even the data from a single point will correctly represent the equation. Therefore, the experiments must be held on the data with low, but significant noise levels.

We have conducted the experiments with the partition of data (Figure 11), containing 80 × 80 × 80 values, divided by spatial axes in fractions from the set {1, 10}. The experiments were held with 10 independent runs on each of the setup (size of input data (number of subdomains, into which the domain was divided, and sparsity constant, which affects the number of terms of the equation).

**Figure 11.** The results of the experiments on the divided domains. (**a**) evaluations of discovered equation quality for different division fractions along each axis (2× division represents division of domain into 4 square parts); (**b**) domain processing time (relative to the processing of entire domain) for subdomain number.

The results of the test, presented in Figure 11, give insight into the consequences of the processing domain by parts. It can be noticed, that with the split of data into smaller portions, the qualities of the equations decrease due to the "overfitting" to the local noise. However, in this case, due to higher numerical errors near the boundaries of the studied domain, the base equation, derived from the full data, has its own errors. By dividing the area into smaller subdomains, we allow some of the equations to be trained on data with lower numerical errors and, therefore, have higher quality. The results, presented in the Figure 11b are obtained only for the iterations of the evolutionary algorithm of the equation discovery and do not represent the differences in time for other stages, such as preprocessing, or further modeling of the process.

We can conclude that the technique of separating the domain into lesser parts and processing them individually can be beneficial both for achieving speedup via parallelization of the calculations and avoiding equations, derived from the high error zones. In this case, such errors were primarily numerical, but in realistic applications, they can be attributed to the faulty measurements or prevalence of a different process in a local area.

5.2.2. Reducing of the Computational Complexity of Composite Models

To execute the next set of experiments, we used the Fedot framework to build the composite ML models for classification and regression problems. The different open datasets were used as benchmarks that allow to analyze the efficiency of the generative design in various situations.

To improve the performance of the model building (this issue was noted in Issue 2), different approaches can be applied. First of all, caching techniques can be used. The cache can be represented as a dictionary with the topological description of the model position in the graph as a key and a fitted model as a value. Moreover, the fitted data preprocessor can be saved in cache together with the model. The common structure of the cache is represented in Figure 12.

The results of the experiments with a different implementation of cache are described in Figure 13.

**Figure 13.** The total number model fit requests and the actually executed fits (cache misses) for the shared and local cache.

Local cache allows reducing the number of models fits up to five times against the non-cached variant. The effectiveness of the shared cache implementation is twice as high as that for the local cache.

The parallelization of the composite models building, fitting, and application also makes it possible to decrease the time devoted to the design stage. It can be achieved in different ways. First of all, the fitting and application of the atomic ML models can be parallelized using the features of the underlying framework (e.g., Scikit-learn, Keras, TensorFlow, etc [43]), since the atomic models can be very complex. However, this approach is more effective in the shared memory systems and it is hard to scale it to the distributed environments. Moreover, not all models can be efficiently parallelized in this way.

Then, the evolutionary algorithm that builds the composite model can be paralleled itself, since the fitness function for each individual can be calculated independently. To conduct the experiment, the classification benchmark based at the credit scoring problem (https: //github.com/nccr-itmo/FEDOT/blob/master/cases/credit\_scoring\_problem.py) was

used. The parameters of the evolutionary algorithm are the same as described at the beginning of the section.

The obtained values of the fitness function for the classification problem are presented in Figure 14.

**Figure 14.** (**a**) The best achieved fitness value for the different computational configurations (represented as different number of parallel threads) used to evaluate the evolutionary algorithm on classification benchmark. The boxplots are build for the 10 independent runs. (**b**) Pareto frontier (blue) obtained for the classification benchmark in "execution time-model quality" subspace. The red points represent dominated individuals.

The effectiveness of the evolutionary algorithm parallelization depends on the variance of the composite models fitting time in the population. It is matters because the new population can not be formed until all individuals from the previous one are assessed. This problem is illustrated in Figure 15 for cases (a) and (b) that were evaluated with classification dataset and parameters of evolutionary algorithm described above. It can be noted that the modified selection scheme noted in (b) can be used to increase parallelization efficiency. The early selection, mutation, and crossover of the already processed individuals allow to start the processing of the next population before the previous population's assessment is finished.

**Figure 15.** (**a**) The comparison of different scenarios of evolutionary optimization: best (ideal), realistic and worst cases (**b**) The conceptual dependence of the parallelization efficiency from the variance of the execution time in population for the different types of selection.

The same logic can be applied for the parallel fitting of the part of composite model graphs. It raises the problem of the importance of assessment for the structural subgraphs and the prediction of most promising candidate models before the final evaluation of the fitness function will be done.

#### *5.3. Co-Design Strategies for the Evolutionary Learning Algorithm*

The co-design of the generative algorithm and the available infrastructure is an important issue (described in detail in the Issue 3) in the task of composite model optimization. The interesting case here is optimization under the pre-defined time constraints [44]. The experimental results obtained for the two different optimization strategies are presented in Figure 16. The classification problem was solved using the credit scoring problem (described above) as a benchmark for the classification task. The parameters of the evolutionary algorithm are the same as described at the beginning of the section. The fitness function value is based on ROC AUC measure and maximized during optimization.

The static strategy *S*<sup>1</sup> represents the evolutionary optimization with the fixed hyperparameters of the algorithm. The computational infrastructure used in the experiment makes it possible to evaluate the 20 generations with 20 individuals in the population with a time limit of *T*0. This strategy allows finding the solution with the fitness function value *F*0. However, if the time limit *T*<sup>1</sup> < *T*<sup>0</sup> is taken into account, the static strategy allow to find the solution *S*<sup>1</sup> with the fitness function value *F*1, where *F*<sup>1</sup> < *F*0.

Otherwise, the adaptive optimization strategy *S*2, which takes the characteristics of the infrastructure to self-tune the parameters can be used. It allow to evaluate 20 generation with 10 individuals in a time limit *T*<sup>1</sup> and reach the fitness function value *F*2. As can be seen, the *F*<sup>1</sup> < *F*<sup>2</sup> < *F*0, so the better solution is found under the given time constraint. g

**Figure 16.** The comparison of different approaches to the evolutionary optimization of the composite models. The minmax intervals are built for the 10 independent runs. The green line represents the static optimization algorithm with 20 individuals in the population; the blue line represented the dynamic optimization algorithm with 10 individuals in the population. *T*0, *T*<sup>1</sup> and *T*<sup>2</sup> are different real-time constraints, *F*0, *F*<sup>1</sup> and *F*<sup>2</sup> are the values of fitness functions obtained with the corresponding constraints.
