Data set generation was inspired by previous work [
16,
43]. A difference is that in these publications, synthetic data consists of categories, and the learning system should guess each item’s feature. This leads to a difference in the covariance structure (compare
Figure 1 above and Figure 9 in Reference [
16]), which is due to the fact that, for example in the binary tree data structure, in the present case correlation patterns tie together all of the nodes in the tree. Each node is associated with a feature (recall, a random variable
that is one entry of the random data vector
), whilst class labels are assigned according to whether a data item matches some of the previously created data vectors in the case of the binary tree. In the case of independent clusters, the category is assigned according to which one of the independent clusters is selected, see below.
Appendix B.1. Binary Tree Data Set
The first example is the binary tree data generating structure. The root node is a random variable, which attains one among the values with equal probability . According to the outcome of such random variable, the children inherit the value according to some probabilistic decision rule and in the same fashion the children of the children, and so forth down the dynasty, see Algorithm A1. The user sets the depth of the tree D to be created. A data sample is the collection of the random variables that constitute the tree structure. An advantage of the PGM representation is that it renders graphical visualisation ease: Data are often many-dimensional, that is, points in a N-dimensional space.
In this case, the collection of M of such vectors could be thought of as an ensemble of living species. The root node determines whether one item (pattern, data example) can move or not. The children of the root node determine whether if it moves, does it swim? or if it does not move, does it have bark?, and so forth. The levels deeper in the tree structure, bear more information about the data items. In the following, it is shown how the choice of a particular level resolves in the presence of more or fewer classes. As one considers the leaves level, then all of the nodes of a tree (i.e., all of the features in a pattern) must equal, for two items to belong to a given class.
On the other hand, if one considers a shallow level in the tree structure, the nodes which must equal for two data vectors to belong to the same class, are all the nodes up to the last node of the level considered, plus those of the next level. For convenience, the root node lays at level 1. To explain why it is to consider such nodes, the reader should refer to
Figure A7. Assume that one wants to gather in a class all the samples that move and swim. Then the sample has to actually move, that is the root node must have the value
, that means that its left child has value
as well and the right child and its dynasty inherit the
value. Then we should consider also the outcome of the stochastic inheritance of the
value from node 1 to nodes 3 and 4. Based on this trial, we know whether the
value is attained by node 3 or 4. Node 3 encodes the answer to the question since the sample moves, does it swim? by means of the value
which means yes. Hence, to say whether two samples belong to the same moving and swimming animals super-class, we should check the equality of the nodes almost up to nodes 3 and 4. In practice, it is easier to check level-wise, thus two samples belong to the same class if all the nodes up to those of the next level match. Next level means next with respect to the level of detail one wants to inspect. In this example, the level is 2. All the subsequent nodes could in principle attain different values but this does not matter. If one wants to differentiate living things based on the fact that such items can move or not, what matters is the value attained by the root node. Then whether two items are respectively a whale or a deer, this does not affect the belongingness to the living thing that can move super-class. In contrast, if one has to differentiate living thing that moves based on the fact that such an item does swim or not, then a further level of detail is needed. Such a finer granularity is encoded by the values the nodes of the next levels attain. If the left children of the root node happen to inherit its
value, that means that other than being a moving living thing, that item does swim. Therefore, the second tree level encodes this subsequent level of detail. The more detail is embedded (the higher level is chosen), the more the possible classes the data examples may in principle belong to.
The rationale behind such a data generator is first and foremost related to its transparency and statistical structure clarity: There is no real-world consistency in such data, but in this fashion, it is easy to perform classification on them. As explained below, one single pattern generation happens to be a value diffusion down to the tree branches. In this way, one ends up with a N-dimensional binary array, in which many of the slots bear the value. The values, on the other hand, lays in correspondence of the slots associated with those nodes which happen to represent a positive answer to the distinction question associated with that node. Consistently with the discussed example: If the living thing encoded in such a dimensional vector is a moving thing (roughly speaking, an animal), then the root node has the value, which in turn means that the 0th slot in the data vector has such value. If this is a water animal, it swims, then the left child of the root node has inherited the value, then the slot 1 in the data vector has the value and it implies that the right child of root inherited the value , so the slot 2 of the data vector has the value . Assume further that other than swimming, this animal is not a mammal. Then the left child of the 3-labelled node has inherited the value and this same value is found in slot 7 of the data vector. It means that the value is inherited by the right child of node 3, then in the final data vector the value appears in slot 8.
Figure A7.
Binary tree data generating structure. Note that the tree data structure is efficiently and easily represented computationally as a linear array. The left and right children of a given node i are and respectively, with .
Figure A7.
Binary tree data generating structure. Note that the tree data structure is efficiently and easily represented computationally as a linear array. The left and right children of a given node i are and respectively, with .
At the end of the day, the final data vector is made up by s, except for these said slots, where the value ended up in, encoding the positive outcome of those criteria associated with the respective nodes. The terminal (leaves) level could be imagined as the one-hots stratum, that is: all of the leaves attain the value, except for one single leaf, where the got to settle, as a consequence of the (stochastic) outcome of all the aforementioned decisions. This lonely determines the final category in which the data vector fits in, as one sets the leaves level to be the distinction granularity. In such a case, for two vectors to belong to the same class, it must be that all of the features equal. Otherwise, it could in principle be that a whale, echoing the previously discussed example, has a positive value for the root node, but in another data row, it could be negative. This would mean that a whale is a not moving living being that swims. So, in the label generation stage, one shall differentiate according to all the nodes of the level under consideration and all of their ancestries.
Appendix B.1.1. Single Pattern Generation
One pattern is the collection of all the node values of the array-represented tree (that entity formerly dubbed a data vector). As an example, to the non-leaves nodes are associated decision rules, intended to discriminate samples (e.g.,: does the object move?, which can be answered with yes or no, , is the primal decision rule, that is, axis along which one can set distinctions). The initial value of the root node is inherited and eventually flipped according to probabilistic decision rules with respect to a fixed probabilistic threshold, .
In this spirit, referring again to
Figure A7, the (non-leaves) nodes ranging from 0 to 6 encode decision rules, (leaves) nodes indexed with
represent the final details about a sample which are not important for the sake of its classification. The following criteria are implemented:
- (1)
The probabilistic threshold is fixed a priori. The smaller its value, the less variability in the data set.
- (2)
Root attains the values with probability .
- (3)
Root’s children attain values or in a mutually exclusive fashion. The following convention is adopted: if the root node attains the value , then the left child inherits the same value. Else, the left child attains the value and the right child has assigned the value .
- (4)
From the third level (children of root’s children), the progeny of any node that has value also has to have value. On the other hand, if one node has value , its value is inherited (again mutually exclusively) by its children according to a probabilistic decision rule.
The aforementioned probabilistic decision rule is a Metropolis-like criterion: Sample a random variable , then, given the probabilistic threshold ,
If , the left child inherits the value, and the right child, alongside with its progeny, assume the opposite value;
Else, is it the right child to assume the value .
Appendix B.1.2. Complete Data Set
Repeating the above procedure M times, one ends up with a data matrix , that is, each row of , , , is one single N-dimensional data vector, in the same terminology as above, that is a N-featured data vector (one pattern).
To complete the creation of a synthetic set of data, one needs the label associated with each one of the data items. Here the choice of the probabilistic threshold turns out to be crucial. The higher this quantity, the more the total number of different classes the data example may fall into. On the other hand, if is small enough, there is a low probability of flipping a feature value, then it is more likely to observe repeatedly the same configuration.
To create the labels, encoded as one-hot activation vectors, one arbitrarily assumes the identity matrix to be the labels matrix. Then the whole data set is explored in a row-wise fashion. Since the data set has a hierarchical structure, it is possible to select the granularity of the distinction made in order to differentiate patterns in different classes. It depends on the choice of a level in the binary tree: If the level chosen is high (far away from the root node) then one ends up with a fine-grained distinction. On the other hand, if the chosen level is low, the distinction is made according to super-classes, for example, whether a given object can move. The finer the granularity, the more detailed the distinction between patterns. Obviously, in this latter case, the data set exhibits a greater number of distinct classes. See the discussion above.
By this observation, the label matrix is created according to the level of distinction chosen. The node values to be considered (i.e., the entries of each
data vector) are all those that encode the values of the nodes up to the last one of the level selected. Referring again to the tree in
Figure A7, if it suffices to identify the move or not distinction, then one could safely check both the root node only or the root node with its children, that is nodes
, because the inheritance from the root node to its children is conventionally based on the root value solely. But if one wants to consider whether an object can move alongside with the further if it moves, does it swim? and if it does not move, does it have bark? distinctions, then one should consider also the children nodes of node 1, that is the answer to the decision rule asked by node 1. Hence to determine whether two data items fall in that same category, we check that all the first
nodes have the same value. Here
, in fact we consider nodes
.
By thus doing the data set is generated. The matrices and are saved to a proper data structure which can be easily managed by the program that implements the artificial neural network described in the main text.
Algorithm A1 Binary tree. Single feature generation |
1. Compute , . M is a free parameter |
2. |
3. tree = |
4. |
5. Define a small as probabilistic threshold |
6. |
7. Value of root |
8. |
9. if Root node has value then |
10. |
11. The left child inherits the value |
12. |
13. And the right child inherits the value |
14. |
15. else |
16. |
17. The left child inherits the value |
18. |
19. And the right child inherits the value |
20. |
21. end if |
22. |
23. for All the other nodes indexed do |
24. |
25. if Node i has value then |
26. |
27. Sample |
28. |
29. if then |
30. |
31. Left child of i = ; Right child of i = |
32. |
33. else |
34. |
35. Left child of i = ; Right child of i = |
36. |
37. end if |
38. |
39. else |
40. |
41. Both the children of i inherit its value |
42. |
43. end if |
44. |
45. end for |
46. |
47. values generated, |
48. |
Algorithm A2 Binary tree. One-hot activation vectors, that is, labels |
1. Choose level of distinction L |
2. |
3. |
4. |
5. for do |
6. |
7. for do |
8. |
9. if the first entries of equal then |
10. |
11. |
12. |
13. end if |
14. |
15. end for |
16. |
17. end for |
18. |
19. for do |
20. |
21. if then |
22. |
23. Eliminate column i of |
24. |
25. end if |
26. |
27. end for |
28. |
Appendix B.2. Independent Clusters Data Set
The generation of the second data set is performed as follows: Generating some cloud of points distributed according to a bivariate Gaussian distribution, with means spread apart and covariances sufficiently small, in such a way that the points of different groups do not overlap with the others. The 2-dimensionality has, of course, nothing to do with the number of features, which as said before is the total number of points generated, that is the nodes of the probabilistic graph representation. This 2-dimensionality serves solely to draw the PGM and subsequently to partition the graph.
Once points are generated, are turned in a fully connected graph, that is, create edges between each pair of nodes. In the spirit of the simulated annealing algorithm, here it is imagined that such a fully connected graph is a sort of mineral structure, and it is to increase the temperature, to simulate a melting process that destroys some of the over-abundant edges, according to some metric, for example, the distance between points. For this reason, it comes handy the 2-dimensional representation: Distance is simply the norm of the vector from a node to another. The distance for which the edge is removed is temperature-dependent: the higher the temperature, the shorter the maximum edge length allowed. At the end of this simulated melting process, it is expected the graph to exhibit some independent components, provided the melting schedule is properly set. Moreover, these independent groups are not fully connected within themselves. The melting schedule is designed in a way to remove some of these intra-edges. This simulates the random variables of each group not to be dependent on all of the others in the same could. Note that, unlike how exposed in Reference [
19], in this melting simulation there is not, strictly speaking, a optimization perspective inasmuch what matters is the removal of some edges. The physics of the procedure could be improved.
Algorithm A3 Independent clusters. Simulated melting to partition the graph |
1. Choose the number of classes |
2. |
3. Set , , |
4. |
5. Generate s.t. , |
6. |
7. Include the indexes of the points generate in a list, which is the set of the vertices of the graph |
8. |
9. Fully connect the vertices to form a fully connected graph and group the vertices and the set of the edges in the graph data structure, . |
10. |
11. Note that since 2-dimensional coordinates will be useful, is a dictionary of keys (nodes indexes ) and values (list with the point coordinates, ). |
12. |
13. for T increasing do |
14. |
15. for All the edges do |
16. |
17. if Length of edge (for example) then |
18. |
19. Remove edge e |
20. |
21. end if |
22. |
23. end for |
24. |
25. end for |
26. |
27. Plot the remaining edges and check if only independent fully connected components have survived. |
28. |
Algorithm A4 Independent clusters. Single pattern generation |
1. Here i indexes a single random variable. This kernel is used as many times as the number of samples the user wants to generate. is the whole data item, initialised with each slot set to . Note: in the data set actually generated, the value of the nodes are set to their topological orders, with no ancestral sampling implemented. |
2. |
3. Set = |
4. |
5. Sample |
6. |
7. for all the vertices in cluster L do |
8. |
9. if Topological Order of i is 1 then |
10. |
11. |
12. |
13. else |
14. |
15. |
16. |
17. end if |
18. |
19. end for |
20. |
21. |
22. |
Appendix B.2.1. Single Pattern Generation
Once the independent clusters come to form, it is to assign each of the nodes a topological ordering in such a way to perform the ancestral sampling [
11]. Since the graphs are directed, in the edges data structure created each edge is in the form of a couple
, that is, edge from node
i to node
j. Then if one node appears only on the left slot of such representation, it has topological order 1, in that no edge ends up at that node. Conversely, each node that appears on the right has almost one ancestor. For each edge then, each right node is saved to a proper data structure, and it is kept track of the ancestors of each node. In this way, it is possible to assign both the topological order and to keep a list of all the ancestors. It will be useful in the stage of sampling to dispose of such a list.
As a zero model however it is done as follows: A data item is initially initialized with all the features values of . Since each vertex in the graph encodes a feature, and the belongingness of each vertex to a group is a piece of information known from the points generation stage, an integer ranging from 1 to the number of classes is sampled uniformly. The nodes corresponding to this label number are assigned different values, according to their topological order. This is trivial to do since for each vertex belonging to the selected group one simply puts in the corresponding slots in the data vector the topological order of such vertices.
A further improvement could be rather this approach: once a label is sampled, one could sample from the distribution
, for the vertices with topological order 1 in that cluster. The values associated with nodes having topological order 2 is still sampled from that distribution, but must be conditioned to the values sampled for their ancestors (nodes of order 1), that is,
. This is explained by recalling the very purpose of graphical models: to show (even graphically) the
causality of the random variables involved. The distribution could be chosen to be a Gaussian with mean zero and variance proportional to the degree of that node. Gaussian is believed to fit since nearby features are expected to have similar values ([
43], but differently from this work, here one does not generate the features vector, hence sampling from the multivariate Gaussian having zero mean and variance dependent on the inverse of the Laplacian matrix of the graph. Here it suffices to sample one value for a single node, and hence the degree of a node could be a good compromise, being such quantity one of the ingredients of the Laplacian).
To sample from the conditional
the following rationale may be implemented: The distribution is referred to all the nodes up to
i, then could be viewed as a multivariate distribution. Then a value is sampled from that multivariate distribution, but keeping constants the values of the random variables sampled yet. As an example: Assume that node 3 of cluster 1 is to be assigned the value
and that
. Then the distribution to sample from is
with
,
, being
the degree of node
i. The above formula may be broken in products, owing to the fact that the variance matrix is diagonal, that is
the first two factors being the values that the Gaussian probability density function attains at the values sampled for the ancestors
and
and the third factor is the value sampled from the Gaussian having zero mean and variance
.
Appendix B.2.2. Complete Data Set
This procedure is repeated many times as specified by the user. Here a good number is, as in the case of the binary tree, items. In the complete data set hence one has features in which the only values not being lay in correspondence of the indexes of the data array that match with the nodes of the graph that belongs to the category given by the label of that feature. Labels are again one-hot vectors. For example, assume that the first cluster is selected. If this first cluster comprises the vertices ranging from 1 to 5, where node 1 has order 1, 2 and 3 have order 2, 4 has order 3 and five has order 4, then that data item has values and the corresponding label is .