3.1. Shape Parameters Analysis
From the definitions of the aforementioned kernels, it is evident that each of them depends on some parameters, and it is not clear, at present, which values are assigned to them. The cross-validation phase—or, for abbreviation, the CV phase—tries to answer to this question. For instance, in the CV phase the user defines a set of values for the parameters, and for any possible admissible choice he solves the classification task and measures the goodness of the obtained results. It is well-known that in the RBF interpolation literature [
25] the shape parameters have to be chosen after a process similar to the CV phase. In the beginning, the user chooses some values for each parameter, and then, varying them, one checks the condition number and the interpolation error, noting how they vary according to the values of the parameters. The trade-off principle suggests considering values where the condition number is not huge (ill-conditioned) and the interpolation error is too small (accuracy). Now, in the context of classification, we have to replace the concepts of the condition number of the interpolation matrix and the interpolation error. We can achieve this goal by considering the condition number of the Gram matrix and the accuracy of the classifier. To obtain a good classifier, it is desirable to have a small condition number of the Gram matrix and high accuracy, as close to 1 as possible. The aim here was to run such an analysis for the kernels presented in this paper.
The PSSK has only one parameter to tune: . Typically, the users consider . We ran the CV phase for different shuffles of a dataset and plotted the results in terms of the condition number of the Gram matrix related to the training samples and the accuracy. For our analysis, we considered and we ran tests on some datasets cited in the following. The results were similar in each case, so we decided to report those for the SHREC14 dataset.
From the
Figure 4, it is evident that large values of
result in an unstable matrix and less accuracy. Then, in what follows, we will take into account only
.
The PWGK is the kernel with a higher number of parameters to tune; therefore, it was not so evident what were the best-set values to take into account. We chose reasonable starting sets as follows: , , , . Due to a large number of parameters, we first ran some experiments varying with fixed , and then we reversed the roles.
We report here in
Figure 5 only a plot for fixed
and
p because this highlights how high values of
(for example
) were excluded. We found this behavior for different values of
and various datasets—here, for the case
,
and the MUTAG dataset. Therefore, we decided to vary the parameters, as follows:
,
,
,
. Unfortunately, there was no other evidence that could guide the choices, except for
, where values
always had bad accuracy, as one can see below in the case of MUTAG with the shortest path distance.
In the case of the
SWK, there is only one parameter,
. In [
11], the authors proposed to consider values starting from the first and last decile and with the median value of the gram matrix of the training samples flattened, in order to obtain a vector; then, they multiplied these three values for
.
For our analysis, we decided to study the behavior of such kernels, considering the same set of values independently from the specific dataset. We considered .
We ran tests on some datasets, and the plot, related to the DHFR dataset, revealed evidently that large values for
were to be excluded, as suggested by
Figure 6. So, we decided to take
only in
.
The
PFK has two parameters: the variance
and
t. In [
12], the authors exhibited the procedure to follow, in order to obtain the corresponding set of values. It shows that the choice of t depends on
. Our aim in this paper was to carry out an analysis that was dataset-independent, which turned out to be strictly connected only to the definition of the kernel itself. First, we took different values for
and we plotted the corresponding accuracies—here, in the case of MUTAG with the shortest path distance, but the same behavior holds true also for other datasets.
The condition numbers were indeed high for every choice of parameters and, therefore, we avoided reporting here, because it would have been meaningless. From the
Figure 7, it is evident that it is convenient to set
lower or equal to 10, while
t should be set larger or equal to 0.1. Thus, in what follows, we took into account
and
.
In the case of the
PI, we considered a reasonable set of values for the parameter
. The results were related to BZR with the shortest path distance redand shown in
Figure 8.
As in the previous kernels, it seemed that the accuracy was better for small values of . For this reason, we set .
3.5. Graphs
In many different contexts, from medicine to chemistry, data can have the structure of graphs. Graphs are couples of a set
, where
V is the set of vertices and
E is the set of edges. The graph classification is the task of attaching a label/class to each whole graph. In order to compute the persistent features, we needed to build a filtration. In the context of graphs, as in other cases, there are different definitions; see, for example, in [
38].
We considered the Vietoris–Rips filtration, where, starting from the set of vertices, at each step we would add the corresponding edge whose weights were less or equal to a current value
. This turned out to be the most common choice, and the software available online allowed us to build it after providing the corresponding adjacency matrix. In our experiments, we considered only undirected graphs, but, as in [
38], building a filtration is possible also for directed graphs. Once defining the kind of filtration to use, one needs again to choose the corresponding weights. We decided to take into account first the shortest path distance and then the Jaccard index, as, for example, in [
14].
Given two vertices the shortest path distance was defined as the minimum number of different edges that one has to meet going from u to v or vice versa, since the graphs here were considered as undirected. In graphs theory, this is a widely use metric.
The Jaccard index is a good measure of edge similarity. Given an edge
then the corresponding
Jaccard index is computed as
where
is the set of neighbors of
u in the graph. This metric recovers the local information of nodes, in the sense that two nodes are considered similar if their neighbor sets are similar.
In both cases, we considered the sub-level set filtration and we collected both zero- and one-dimensional persistent features.
We took six of such sets among the graph benchmark datasets, all undirected, as follows:
MUTAG: a collection of nitroaromatic compounds, the goal being to predict their mutagenicity on Salmonella typhimurium;
PTC: a collection of chemical compounds represented as graphs that report the carcinogenicity of rats;
BZR: a collection of chemical compounds that one has to classify as active or inactive;
ENZYMES: a dataset of protein tertiary structures obtained from the BRENDA enzyme database; the aim is to classify each graph into six enzymes;
DHFR: a collection of chemical compounds that one has to classify as active or inactive;
PROTEINS: in each graph, nodes represent the secondary structure elements; the task is to predict whether or not a protein is an enzyme.
The properties of the above are summarized in
Table 3, where the IR index is the so-called
Imbalanced Ratio (
IR) that denotes the imbalance of the dataset, and it is defined as a sample size of the major class over a sample size of the minor class.
The computations of the adjacency matrix and the PDs were made using the functions implemented in tda-giotto.
The performances achieved with the two edge weights are reported in
Table 4 and
Table 5.
Thanks to these results, two conclusions can be reached. The first one is that, as expected, the goodness of the classifier is strictly related to the particular filtration used for the computation of persistent features. The second one is related to the fact that the SWK and the PFK seem to work slightly better than the other kernels: in the case of the shortest path distance in
Table 4, the SWK is to be preferred while the PFK seems to work better in the case of the Jaccard index
Table 5. In the case of PROTEINS, in both cases the PWGK provides the best Balanced Accuracy.