*Proceeding Paper* **AGI's Hierarchical Component Approach to Unsolvable by Direct Statistical Methods Complex Problems †**

**Vladimir Smolin \* and Sergey Sokolov**

Keldysh Institute of Applied Mathematics, Russian Academy of Sciences, Miusskaya Sq., 4, Moscow 125047, Russia; sokolsm@list.ru


**Abstract:** The amazing deep neural network (DNN) advances over the past 10 years have made it possible, if there is enough data and computing power, to achieve solutions to unexpectedly complex problems. But DNN does not explicitly use decomposition, the main advancement method in complicated task solving. The automatic complex scenes decomposition can be carried out based on mapping by a neural network. The problem is the impossibility to map complex objects and phenomena state spaces. The hierarchical complex scene's division into simple components can be a key for solving the problem. The hierarchically organized structure of simple objects and phenomena maps of different abstraction levels can make it possible to solve problems in a complex environment, in which all properties cannot directly be revealed by statistical methods. Operation modes of such a hierarchical structure can be correlated with terms used in philosophy and psychology.

**Keywords:** artificial general intelligence (AGI); decomposition; AGI agent

### **1. Introduction**

The neural network revolution in machine learning, based on the use of deep neural networks, has led to tremendous progress in the AI field. Success is usually associated with parallel computing methods development, collecting big data, the neural network structures, and the algorithms embedded in them improving.

The central training algorithm for modern neural networks that solve applied problems is the backpropagation error (BPE) method, which implements the idea of gradient descent. The first ideas of training multilayer networks were expressed by Rosenblatt [1], improved by Rumelhart [2], and were formulated close to modern concepts as early as 1986 in [3]. There were other publications containing similar learning algorithms for neural networks [4].

But it took almost 25 years for BPE methods to demonstrate their effectiveness in solving practical problems [5]. The BPE use without batch-norm [6], dropout, and a number of other algorithms that complement BPE does not allow solving complex applied problems. But even if you use the entire arsenal of modern algorithms and apply it to a neural network with a random structure, then there will be no result either. This begs the question:

#### *1.1. Is BPE Optimization the Best Way to Learn?*

Gradient descent implemented by BPE is an optimization algorithm. Although BPE is used to train deep neural networks, the idea of optimizing their parameters by anti-gradient lies on the surface. Optimization means improvement, but it can be small or significant, fast or slow, and has a number of parameters that allow us to compare different optimization methods for solving various problems. Twenty-five years of BPE improvements have made it possible to use this method for solving application problems, and in the past 10 years with commercial financial support, progress has become even more noticeable.

**Citation:** Smolin, V.; Sokolov, S. AGI's Hierarchical Component Approach to Unsolvable by Direct Statistical Methods Complex Problems. *Eng. Proc.* **2023**, *33*, 67. https://doi.org/10.3390/ engproc2023033067

Academic Editors: Askhat Diveev, Ivan Zelinka, Arutun Avetisyan and Alexander Ilin

Published: 10 October 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### *1.2. Are There Other Ideas besides Gradient Descent?*

The fact that BPE ideas development has continued for over 35 years shows that optimization is not limited by gradient descent.

Another thing is that BPE does not guarantee the formation of the optimal transformation, such as localization, decomposition, linearization, etc., methods.

Neural networks with tens and hundreds of layers work and successfully solve many applied problems, but require large hardware and computational costs. The article [7] shows that just by analyzing the efficiency of using individual parameters, you can turn off 98% of parameters without losing the transformation accuracy.

The transition from the random formation of advanced optimization methods to their purposeful use allows increasing the efficiency of using (reducing the number of required) parameters of neural networks by thousands of times.

#### **2. Mapping**

Neural networks are universal converters that display the input signal vector *X* into the output vector *Y*. It is easy to understand that for such a display it is necessary that different *Y* correspond to diverse *A* stated of the internal neural network activity. For this, it is sufficient (but not necessary) that for different *X*, diverse *A* are formed.

BPE does not address this condition explicitly. In textbooks [8], it is customary to write that dropout simply improves the performance of BPE. There is reason to believe that the improvement is due to the fact that if different neural network parameter subsets are trained on different *X*, then diverse *A* will be formed for different *X*. But still, using dropout just makes it more likely for this condition to be met.

Mapping [9] is an algorithm that deterministically ensures the fulfillment of the formation conditions for different *X* diverse *A* .

#### *2.1. Low-Dimensional Maps*

Neural network mapping algorithms go back to the k-means method [10]:

$$
\Delta \vec{M}\_i = \eta \left( \vec{X} - \vec{M}\_i \right), \eta \ll 1 \tag{1}
$$

where *M <sup>i</sup>* is *i*-th element input connection weights vector and *X*—input vector.

The *i*-th element is chosen using the WTA (winner takes all) method, that is, in k-means, the *M <sup>i</sup>* connections of the most active elements are changed. Improving the k-means method in neural network mapping algorithms comes down to ensuring that all elements of the neural layer participate in the input vectors *X* mapping.

Low-dimensional (the diapason of dimensions 2–3) neural network mapping is widely used for data visualization. But the limited use of neural network maps in solving complex problems is associated not so much with the difficulty of their visualization, but with the impossibility of mapping high-dimensional *X* spaces of the input signal *X* by a dimension of 15–20.

Note that the dimension of *X* spaces is determined not by the number of sensors, but by the properties of the observed object or phenomenon, because *X* is a subspace in the space of theoretically possible sensor activities.

#### *2.2. Uneven Mapping*

As a result of training, k-means and neural network maps divide the input signal state spaces into subdomains, and each is characterized by approximately the same probability (frequency) of the vector *X* hitting it. Such mapping will be called uniform.

But to solve the transformation problems *X Y*, more accurate mapping (representation of smaller areas) *X* is needed not so much where *X* appears more often, but where the error

Δ *Y* is greater than the average. This can be ensured by the dependence of the learning rate on Δ *Y*, for example:

$$
\Delta \vec{M}\_i = \eta \left(\Delta \vec{Y}\right)^2 \left(\vec{X} - \vec{M}\_i\right), \eta \ll 1 \tag{2}
$$

#### **3. Mapping as Transformation**

Neuromapping allows us to transform the input vector *X* into the mapping layer activity *A* . The standard WTA rule use causes one element in the mapping layer to have activity equal to 1, and the rest elements activity to zero. If we simply connect each element of the mapping layer with the output layer and assign the weights of the output connections *Li*, equal to the average *Y* value in the area *X* where *Ai* = 1, then we obtain a zero-order transformation approximation of *X Y*.

If only transformations *X Y* of dimensions 1 or 2 were considered, everything is simple. But for *X* dimensionality like 5, 10, and especially 15, increasing the accuracy will require exponentially increasing costs. The transition to at least a piecewise linear approximation will give great savings in the mapping layer elements number.

#### *3.1. Piecewise Linear Transformation*

For piecewise linear transformations, it is necessary to go from WTA to kWTA, in which not one, but k of the most active elements of the mapping layer is selected. This is not difficult if you can determine the value of k, the dimensionality of *Xi*.

In addition, the activities *Ai* for the (multidimensional) piecewise linear approximation implementation of the transformation *X Y* must differ from 1 and depend on the *M <sup>i</sup>* and *X* ratio. Moreover, it will be necessary to estimate *Gi* = *max*(*Ai*) over all *X*. Then both k and all *Ai* can be found from the equality:

$$S\_k = \sum\_{j=1}^k \frac{A\_i}{G\_i} = 1\tag{3}$$

in which *Ai* = *M <sup>i</sup> <sup>X</sup>* <sup>+</sup> *Bi* <sup>−</sup> *<sup>T</sup>*, where *Bi* and *<sup>M</sup> <sup>i</sup>* are the parameter learning result of the layer elements by the threshold mapping algorithm, and k and T are chosen as follows: *T* = *M <sup>k</sup> <sup>X</sup>* + *Bk* is preliminarily taken so that *Sk* ≥ 1 and *Sk*−<sup>1</sup> < 1. This determines the k value, and the exact value of T, which provides equality (3), is obtained from the linear equation solution.

#### *3.2. Nonlinear Transformations*

The piecewise linear transformations disadvantage is kinks, which significantly reduces the smooth transformations approximation accuracy. This shortcoming can be overcome by non-linear obtained from *k* neuromaps approximations averaging, moreover, each neuromap can have more than *k* times fewer elements.

#### *3.3. Mapping Limitations*

The transition from the zero-order approximation to the piecewise-linear approximation allows one to move from mapping input signal spaces states *X* with the diapason of dimensions 2–3 to diapason 5–10, and nonlinear approximation to 12–15. Perhaps more subtle approximation methods will allow to move a little further into the region of high dimensions. Anyway, increasing the *X* dimension, leads to an exponential rise in costs. Just that advanced methods allow reducing the base and coefficient of the exponential function.

If there is a decomposition possibility and there are no significant time restrictions on using several maps (it is longer than the approximation based on one map), then it is always more efficient to create several low-dimensional maps instead of one high-dimensional. This will require significantly fewer hardware costs and training time.

Conclusion: the main way to use neuromapping is to decompose complex scenes and tasks into simpler (low-dimensional) components, as well as these components' localization and linearization.

#### **4. Localization, Decomposition, Linearization**

Decomposition is the most important condition for using neural mapping in applied problems. Localization speeds up the learning process and reduces computational costs. Transformations linearization allows moving further into high dimensions.

#### *4.1. Localization*

Modern neural networks operate on a "distributed" memory model. BPE does not provide a special algorithm for the changes localizing. But all changes occur locally on the weights of connections. For efficient operation, a "sparse" representation is needed so that the changes made for different input signals do not spoil each other.

Mapping provides competitive WTA or kWTA activation and input signals recording in the connections weights of various hidden layer elements. Decomposition ensures the complex scenes' various component properties are distributed through different neural network maps, enhancing the data localization.

#### *4.2. Decomposition*

The impossibility of complex scenes mapping leads to their decomposition into simple components for which mapping is possible. But for the complex signals perception, it is necessary to compare them with the sum of the simple components memorized earlier. That is, the maps remember not only the transformation *<sup>X</sup> <sup>i</sup>* <sup>→</sup> *<sup>Y</sup>i*, but also *<sup>X</sup> <sup>i</sup>* <sup>→</sup> ˜ *X <sup>i</sup>*, where *<sup>Y</sup><sup>i</sup>* and ˜ *X <sup>i</sup>* are the output and input vectors stored on the *<sup>i</sup>*-th map, corresponding to the *<sup>i</sup>*-th component of the input signal *X*. Then, we can extract *X <sup>i</sup>* from the complex signal *X*:

$$
\vec{X}\_i(t) = \vec{X}(t) - \sum\_{j \neq i}^{N} \vec{X}\_j(t) \tag{4}
$$

Similar splitting input signal ideas were expressed in other works [11,12], but, without the use of mapping, they were not developed. The successful decomposition of complex signals is possible when there are maps of complex signal components. Otherwise, a complex situation cannot be represented as a simple components sum.

#### *4.3. Linearization*

Our world in general is described by nonlinear laws. But, for example, most of the laws of physics are reduced to a linear form by taking a logarithm. Similarly, the significant transformations part carried out by neural networks can also be linearized.

The linearization makes it possible to perform transformations of a higher dimension, not only to describe them more compactly. But the linearized transformations' main property is the component's contribution to the independence of the *X <sup>i</sup>* to the change *Yi*. This allows selecting the essential variables in *X <sup>i</sup>*.

#### **5. Hierarchical Structure**

The complex signals division into simple components involves the reverse process—the restoration from the components. Partially, such restoration is used at each level when calculating (4), but in a more complete way, it can be implemented in a hierarchical structure. It is even more important that new variables can be distinguished in the maps of the lower levels—the map coordinates, abstractly representing the properties of the lower level signals. And for the upper levels compose complex signals from these more abstract descriptions and also reveal dependencies in them by decomposition and mapping means.

#### *5.1. High-Level Abstraction*

Multiple mapping procedure repetitions and signal component map coordinates emerge over the hierarchy levels, opening the way to the high-level abstractions formation directly based on the sensors activity without any human participation.

For the maps' hierarchy successful formation, effectors are also needed to influence the outside world and a block for the action results evaluation, but effectors and evaluation description is beyond the short article scope.

But the central idea for describing the local part of a complex world remains its decomposition into components, indirectly describing by maps, the sensors' activity and more abstract variables changes—the coordinates of maps of different levels.

#### *5.2. Abstract Processes Modeling*

Mapping allows identifying relatively simple dependencies between the different hierarchy level signals and building complex signal component models, regardless of the abstraction level.

Maps can describe not only static but also dynamic objects properties. This makes it possible not only to observe the scene's current state, but also to predict the development of this or another scene, not necessarily related to the current observation.

The need to model various (depending on the selected actions) options for the complex scene components development is associated with the inability to collect complex scene statistics and model them directly, without decomposition. But it is not necessary to model complex scenes in all details—first it is enough to use high-level variables for comparing different options and only after choosing the best ones to consider chosen variants in detail, closer to the real-world variables.

#### **6. Screens**

Simple components of possible vector states representing complex signals can be stored in the map's structure (this is why simple components differ from complex ones). The complex signals are almost never repeated, and it makes no sense to memorize them. But precisely complex signals are needed to be analyzed and transformed, but mapping cannot cope with them. To work with complex signals inaccessible for mapping, it is necessary to use special structures for displaying complex signals—screens. Screens are not intended for visualizing complex signals (which is impossible for high hierarchy level abstract representations), but for processing complex signals, their formation, and transmission between levels of the map's hierarchical structure.

#### *6.1. What the Seen for the First Time Scenes Could Be Compared to?*

The never-repeating complex signal vectors *<sup>X</sup>* can be compared with the sum of ˜ *X i*, stored in simple objects and phenomena maps. The comparison, according to (4), is aimed at identifying individual simple components in the complex signal structure.

At higher levels of the hierarchy, complex signals are formed from the coordinates of the lower levels maps. These coordinates are abstract descriptions of the essential objects' properties. Complex high-level signals are not only constructed from directly observed objects and phenomena. Objects whose state is estimated on the basis of modeling using dynamic properties reflected in their maps can be used too.

The screen also selects which maps to use to form a complex signal. That is similar to the attention not only to observable but also to unobservable objects, affecting the actions' choice. An equally important screen function is to track the set for the lower level goals' achieving process. If the process is going successfully, the upper levels can be switched to modeling actions not related to the lower scene's current state. Otherwise, it is necessary to try to change the current goal.

#### *6.2. How Can Abstract Goals Be Turned into Real Actions?*

The AGI agent effectors and muscle actions formation are reduced to control signal creation. Effectors must also have internal sensors and, from the control point of view, represent a part of the world. Simple objects state maps can be formed on the data from both an external and internal sensor basis. For the lowest hierarchy levels associated with effectors, the transformation *<sup>X</sup> <sup>i</sup>* <sup>→</sup> *<sup>Y</sup><sup>i</sup>* consists in generating a direct control signal *<sup>Y</sup><sup>i</sup>* based on the current state of effectors *X <sup>i</sup>*. But this is not the only way to determine actions. We need an action goal ˜ *X <sup>i</sup>*, which determines the desirable changes in the effectors' state. This goal comes from higher levels.

If the goal ˜ *X <sup>i</sup>* achievement from a similar state *X <sup>i</sup>* was carried out repeatedly, then (after learning) you can set a goal that is very different from the current state, and the map will cope with achieving such a goal. If the target is being set for the first time, then upper levels should plan small changes to the current effectors' state so that the target can be reached based on the map's available properties.

As we move to higher hierarchy levels, goals become more abstract. First, instead of controlling effectors, changes in the objects in the external world are planned, then the processes organization, and so on. The different goals formations are carried out on the varying abstractness degrees modeling basis, depending on how close the chosen goal is to achieving it and how specific actions are necessary for this.

#### **7. Intuition and Thinking**

Considering an objective neural network model (neuromaps multi-level hierarchical structure) from a third-person perspective allows obtaining a new look (not subjectively) at a number of philosophical and psychological concepts. This gives hope for removing the gap between neuronal activity and higher-level concepts such as consciousness, thinking, intuition, etc. The easiest to explain are intuition and thinking.

#### *7.1. Intuition*

Intuition corresponds to actions and decisions taken without thinking on the basis of transformations *<sup>X</sup> <sup>i</sup>* <sup>→</sup> *<sup>Y</sup><sup>i</sup>* available in the maps. But this is not just pulling the hand away from the hot (what can be attributed to reflexes), but taking into account the current situation complexity. If there is no time for thinking, then the maps' hierarchical component structure can be based on transformations *<sup>X</sup> <sup>i</sup>* <sup>→</sup> *<sup>Y</sup><sup>i</sup>* that use all levels for changing goals and perform actions to achieve them—an analogue of intuition.

#### *7.2. Thinking*

Complex scenes do not repeat themselves, their development prediction (without simulation) is extremely inaccurate. Still, you can act on the basis of previous experience and not think about it. If the situation has not been encountered before, then it is better to simulate the options for its development when performing various actions (based on the available component maps). This will allow us to choose much better action options based on the modeling of consequences—an analogue of thinking.

#### **8. Consciousness, Understanding and Emotions**

More difficult to explain are concepts that are not directly aimed at choosing actions. It is not clear why consciousness, understanding, and emotions are needed at all, because there is BPE, RL (reinforcement learning) and it seems that this is enough to solve any problems [13]. For simple tasks—yes, it is fairly.

For complex problems that cannot be solved directly using statistical methods, it is extremely useful to use preliminary action results modeling before they are performed. This requires solving and controlling the neural networks modes operation problem, not directly related to the formation of actions but has analogies with consciousness, understanding, and emotions.

#### *8.1. Consciousness*

In a hierarchical component model, all levels (except the bottom one) can be used in two modes: action or simulation (which corresponds to intuition and thinking) [14].

If the lower levels of the hierarchy successfully cope with the achievement of the set goal, then it is known what situation will arise. And there is time for modeling several options and comparing them. This corresponds to proactive behavior when we purposefully achieve goals and prepare new goals and actions in advance for the situation we are striving to reach.

If the set goal is not achieved, then we are forced to return to reactive behavior, which is usually less rational than proactive. Then the upper levels must be used in an intuitive way to set new goals. Within the hierarchical component model framework, as an analog of consciousness, one can consider the mechanism for coordinating and distributing resources between the execution of thinking and intuitive activity. This function can be attributed to screens or to a special part of the system.

#### *8.2. Understanding*

Successful actions construction (both on the basis of intuition and thinking) is associated with the correspondence degree of models (maps) to the actually observed situation. And there is an alternative: to use the existing models immediately or try to improve them first in various ways. To solve the mode selection problem, it is necessary to compare the complex input signal (which never repeats) with its available components models. And the considered hierarchical model allows us to carry it out.

Improvement of component models at lower levels is carried out by observing real objects. And at the upper levels—by implementing the simulation of different options based on the existing lower-level models. Separation by the level functions is carried out by consciousness, which also uses the degree of mismatch (understanding) assessment between the observed situation and its component model.

An understanding analog is the ability to assess the correspondence degree of a component's model to the observed state of real objects and phenomena. The assessment of the correspondence (understanding) degree is used by the function of consciousness to select the system operating modes.

#### *8.3. Emotions*

In the process of modeling, it is necessary to reduce all estimates to a single indicator and choose the best (forecast) option. But the choice is not only between action options but also between modes of operation, of which there are quite a lot (intuitive and meaningful behavior, achieving goals, avoiding danger, search behavior, and others).

In the hierarchical component model, consciousness controls only the switching between intuition and thinking. Activation degree regulation of other modes can be correlated with emotions. In animals, similar control is carried out due to the hormonal background, which also allows switching the neural system modes.

#### **9. Conclusions**

The proposed ways of creating a hierarchical component system are aimed at the model's automatic creation of the world, sufficient for the formation of AGI-level actions. The main idea is to record the surrounding world's simple properties through neural network mapping. The hierarchical component system is essentially aimed at representing complex signals in a form that allows mapping (in the form of low-dimensional components).

The hierarchical structure of maps and screens make it possible to implement the localization, decomposition, and linearization ideas of the performed transformations. Maps work with components of complex signals. Information about the components can be accumulated by statistical methods. Screens process and form complex, non-repetitive signals, which are compared with the components obtained from the created maps.

The complex structure and several operation modes of a hierarchical component system allow correlating some of the properties with philosophical and psychological ideas about intuition, consciousness, etc., by objectively analyzing the properties available for a scientific research formalized structure. The proposed hierarchical component system was created not to argue with philosophers, but to achieve progress in the creation of AGI agents. The development of a hierarchical component system began several decades before the call of the "godfathers" of deep learning [15] to search for new ideas and approaches, but it can serve as a response to this call.

The main ideas and some algorithms of the hierarchical component system are still under development. to create AGI agents based on it, a large additional amount of research is needed. But BPE's path to commercial success was not easy either: about 25 years passed from development to the start of solving practical problems. The modern industry of AI and neural networks allows going this way for a hierarchical component system much faster. The prize for the proposed approaches development can be an increase in the efficiency of using (reducing the number of required) parameters of neural networks by thousands times.

**Author Contributions:** Conceptualization, V.S. and S.S.; methodology, V.S. and S.S.; software, V.S. and S.S.; validation, V.S. and S.S.; formal analysis, V.S. and S.S.; investigation, V.S. and S.S.; resources,V.S. and S.S.; data curation, V.S. and S.S.; writing—original draft preparation, V.S. and S.S.; writing—review and editing, V.S. and S.S.; visualization, V.S. and S.S. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

MDPI St. Alban-Anlage 66 4052 Basel Switzerland www.mdpi.com

*Engineering Proceedings* Editorial Office E-mail: engproc@mdpi.com www.mdpi.com/journal/engproc

Disclaimer/Publisher's Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Academic Open Access Publishing

mdpi.com ISBN 978-3-0365-9289-3