1. Introduction
Several phenomena and concepts in real-life applications are represented by angular data or, as they are referred to in the literature, directional data. Examples of data that may be regarded as directional include temporal periods (e.g., time of day, week, month, year, etc.), compass directions, dihedral angles in molecules, orientations, rotations, and so on. The application fields include the study of wind direction as analyzed by meteorologists and magnetic fields in rocks studied by geologists.
The fact that zero degrees and 360 degrees are identical angles, so that for example 180 degrees is not a sensible mean of two degrees and 358 degrees, provides one illustration that special methods are required for the analysis of directional data.
Directional data have been traditionally modeled with a wrapped probability density function, like a wrapped normal distribution, wrapped Cauchy distribution, or von Mises circular distribution. Measures of location and spread, like mean and variance, have been conveniently adapted to circular data.
The design of pattern recognition systems fed with directional data has either relied completely on these probabilistic models or just ignored the circular nature of the data.
In this work, we formulate for the first time a non-probabilistic model for directional data classification. We adopt the max-margin principle and the hinge loss, yielding a variant of the support vector machine model.
The theoretical properties of the model analyzed in the paper, together with the robust behavior shown experimentally, reveal the potential of the proposed method.
2. State-of-the-Art
Classical methods for method design include probabilistic and non-probabilistic approaches. Probabilistic approaches come in two flavors, generative modeling of the joint distribution and discriminant modeling of the conditional probabilities of the classes given the input. Non-probabilistic approaches directly model the boundaries of the input space or, equivalently, model the partition of the input space in decision regions.
Directional data classifiers have been typically approached [
1,
2,
3] with generative models based on the von Mises distribution. The von Mises probability density function for the angle
is given by:
where
is the modified Bessel function of order zero,
is the concentration parameter, and
the mean angle.
Analyzing the posterior probability of the classes,
under the von Mises model for the likelihood, it is trivial to conclude that:
where
,
, and
are functions of the mean and concentrations parameters. Recently [
4], a directional logistic regression has been proposed that fits Model Equation (
3) directly from data. In there, the multidimensional setting was naturally extended to:
where
and is the
ith element in vector
.
Noting that:
where
and
are obtained from
and
, the directional logistic regression model is favorably written as:
enabling the learning task to be solved with conventional logistic regression, by first applying a feature transformation, where each input feature
yields two features,
and
.
3. Support Vector Machine
For an intended output
and a classifier score
, the hinge loss of the prediction
is defined as
, where
is the input and
is the vector of parameters of the model. The Support Vector Machine (SVM) model solves the problem:
where
is a regularization parameter. In standard
spaces, the model
is set to the affine form,
, and the previous equation can be equivalently written as:
In the trivial unidimensional space, the model boils down to
, and the partition of the input space is defined by a single threshold; see
Figure 1a.
In the following, to avoid unnecessarily cluttering the presentation, we will stay in the unidimensional space, returning only in the end to the multi-dimensional problem. We will also assume the period for the directional data.
4. Symmetric Directional SVM
For directional data, the model
has to be adapted, as it should be periodic, continuous, and naturally take positive and negative values in the circular domain (so it can aim to label positive and negative examples correctly). Arguably, the most natural extension of the linear model in
is the piecewise linear model in
; see
Figure 1b,c. Note that, now, the partition of the input space requires two thresholds.
Motivated by this observation, we explore models of the form:
where we start by investigating the following specific realizations for
:
, where is the triangle wave with unitary amplitude, period , and maxima at . This function is piecewise linear, and so, it is close to the linear version in the standard domain.
, where . This option can be seen as a rough approximation to the intuitive choice , but as we will see, analytically more tractable.
4.1. Expressiveness of
While in the standard domain, the linear model is able to express (learn) an arbitrary threshold in the input domain, in the directional domain, we need the ability to express any two thresholds. It is easy to conclude that, when instantiated with or , the model is able to express two thresholds in the circular domain, whatever their positions are, as formally stated and proven in Proposition 1.
Proposition 1. Let be an even periodic function with period . For any distinct and in , there exists a θ in such that is zero at and .
Proof. Set and . Note that , and . Now, setting , , and yields a model with the desired zeros. □
For the
option, using Equation (
5),
can be equivalently written as
, where
. Therefore, we consider the following equivalent model:
Similar to the result obtained with the directional logistic regression, the optimization problem in Equation (
7) can be efficiently solved by first transforming each directional feature
x into two new features,
and
, and then relying on efficient methods for the conventional primal SVM, such as Pegasos [
5].
Unfortunately, the analogous equivalence does not hold for the triangle wave
. For
,
cannot be written as
, where
. Still, we could be led to assume the decomposition Equation (
10) as a good approximation, when instantiated with
and use it in practice, with the benefit of using standard SVM toolboxes in pre-processed data. However, the expressiveness of this model is quite limited. For instance, the model
is unable to learn two thresholds in
. Since this model is linear in this interval, the result follows.
As such, for
, we solve the learning task defined by Equation (
7) using sub-gradient methods.
5. Kernelized Symmetric Directional SVM
By the representer theorem, the optimal
in Equation (
8) has the form:
where
k is a positive-definite real-valued kernel and
. Benefiting from the decomposition of each directional feature in two, this formulation is directly applicable to the primal, fixed margin, directional SVM, when using
. As such, all the conventional kernels can be applied in this extended space.
When the model is instantiated with , x is mapped in a two-dimensional feature vector, , and the inner product between and becomes . As such, the feature transformation can be avoided by setting as the kernel the cosine of the angular difference, .
As seen before, in the case of , a similar conclusion does not hold. However, the result for suggests also investigating the interest of using the function as a kernel. We start by presenting Theorem 1, where we show that a broad family of functions, which includes both and , may be used to construct formally-valid kernels.
Theorem 1. Let be a periodic function with period T and absolutely integrable over one period. Define as the autocorrelation of h, i.e.: Then, is a kernel function, i.e., there exists a mapping ϕ from to a feature space such that .
Remark 1. The triangle wave is the autocorrelation, as defined in this paper, of a square wave with amplitude and period .
Having proven the validity of
as a kernel, we now focus on investigating the expressiveness of the resulting SVM. The note made in
Section 4.1 supports that the sum of triangle functions centered in fixed positions is not expressive enough since it cannot place the decision boundaries in arbitrary positions. However, the kernelized version in Equation (
11) can still be appealing, since now, the models are centered in the training observations, and, as such, adapted in number and phase to the training data. For the purpose of this analysis, the notion of the Vapnik–Chervonenkis (VC) dimension [
6], given in Definition 1, will be useful.
Definition 1. A parametric binary classifier , with parameters , is said to shatter a dataset if, for any label assignment , there exist parameters such that classifies correctly every data point in . The VC dimension of is the size N of the largest dataset that is shattered by such a classifier.
Thus, the VC dimension of a classifier provides a measure of its expressive power. In Theorem 2, we establish a result that determines the VC dimension of the kernelized SVMs we are considering.
Theorem 2. Let be a kernel function where g is defined as in Theorem 1. Furthermore, suppose that g has zero mean value and its Fourier series has exactly N non-zero coefficients. Then, the VC dimension of the classifier:equals .
The Fourier series of the triangle wave
has an infinite number of non-zero coefficients, and therefore, the classifier instanced with the triangle wave kernel
has infinite VC dimension. On the other hand, the VC dimension of the classifier instanced with the cosine kernel
equals three. Consequently, the SVM with the triangle wave kernel is able to express an arbitrary number of thresholds in the circular domain, unlike the SVM with the cosine kernel or with the triangle wave in the primal form, which, as proven before, can only express two thresholds in
.
Figure 2 illustrates these differences.
However, depending on the relative position of the data points, even the SVM with the triangle wave kernel may fail to assign the correct label to all of them. In order to overcome this limitation, composite kernels, constructed from this baseline, can be explored. Typical cases include the polynomial directional kernel, , where d is the polynomial degree and the directional RBF kernel, (while the standard RBF kernel relies on the Gaussian expression, the directional RBF kernel relies on the expression of the von Mises distribution).
Figure 3a shows a simple training set that is correctly learned both with the primal and kernel triangle wave formulations. On the other hand, it should be clear that the setting in
Figure 3b cannot be correctly learned by these same models. In this case, setting the SVM with the kernel
achieves the correct labeling.
It is important to note that standard, off-the-shelf, toolboxes can be used to solve the kernelized directional SVM directly. One just needs to properly define the kernel as discussed before.
6. Asymmetric Directional SVM
In
Figure 4, we portray a toy dataset together with the model that optimizes Equation (
7) using
in the model
. As observed, the margin is determined by the “worst case” transition between positive and negative examples. It is reasonable to assume that a model placing the second threshold centered in the gap between positive and negative examples would generalize better. Shashua and Levin [
7] faced a problem with similar characteristics when addressing ordinal data classification in
. Similar to them, we propose to maximize the sum of the margins around the two threshold points. Towards that goal, we only need to generalize the model to allow independent slopes in the two parts of the triangle wave, setting
, where:
Here,
controls the asymmetry of the wave: if
, the wave has infinite ascending slope; if
, the wave has infinite descending slope, and for
, it coincides with the symmetric case,
. The wave is depicted in
Figure 5 for some values of
.
It should be clear that the model instanced with retains the same expressiveness as before, being able to express two thresholds (and not more than two) in any position in .
As before with , it is not possible to solve the optimization problem as a conventional setting, and we again directly optimize the goal function using sub-gradient methods.
7. Kernelized Asymmetric Directional SVM
Before, motivated by the behavior with
and the representer theorem, we explored models of the form
. Using the decomposition depicted in
Figure 6,
, we can rewrite the model as
. We can now gain independence in the two slopes of the model by extending it to
, where
and
are two independent parameters to be optimized from the training set.
Since
, the model equals:
8. The Multi-Dimensional Setting
The extension of the ideas presented before to the multi-dimensional setting is easy. For this purpose, assume our data consist of both directional and non-directional components. This allows each data example to be represented as a vector
, where
represents the directional components and
represents the non-directional ones. Suppose we wish to represent the
ith directional component
in a feature space
, through a mapping
, and the non-directional ones in a feature space
, through a mapping
. Then, our model
becomes:
where
. Therefore, in the standard setting where the feature spaces are fixed and possibly infinite dimensional, but the respective inner products have a closed form, we may use the kernel trick to solve the optimization problem. Such kernel is an inner product in the joint feature space
and equals the sum of the individual kernels:
where
and
.
If the feature mappings
are finite dimensional functions that depend also on parameters to be optimized, like, for instance, in the case of
, the kernel itself becomes dependent on such parameters. In this setting, we opted to plug Equation (
16) directly into Equation (
7), solving the problem directly in its primal form using gradient-based optimization.
For simplicity, we set to the identity in our experiments, inducing the usage of the linear kernel for all the non-directional components.
9. Experiments
In this section, we detail the experimental evaluation of the proposed directional support vector machines against two state-of-the-art directional classifiers: the von Mises naive Bayes [
8] and the directional logistic regression [
4]. Following [
4], the
parameter of the von Mises distribution was approximated by 100 iterations (a much larger number of iterations than required to have good convergence values) of Newton’s method proposed by Sra [
9].
The SVM regularization constant
was chosen using a stratified 3-fold cross-validation strategy. The range of explored values was
. The concentration parameter
of the directional RBF kernel was also selected through 3-fold cross-validation in the range
. The primal directional SVM with fixed margin was randomly initialized and optimized using Adam [
10] for 500 iterations. On the other hand, we initialized the asymmetric primal SVM with the fixed margin margin parameters after 400 iterations. Then, all the parameters were fine-tuned for an additional 100 iterations. Using pre-trained parameters for the SVM with an asymmetric margin facilitates convergence given the coupled effect of the
and
parameters. The kernelized directional version with cosine and triangular kernels was optimized using the standard libsvm [
11], which implements an SMO-type algorithm proposed in [
12]. For the asymmetric kernel, we used the aforementioned fine-tuning approach in order to fine-tune the
coefficients obtained by the standard toolkit.
We validated the advantages of the proposed approach using 12 publicly-available datasets: Arrhythmia [
13], Behavior [
14], Characters [
15], Colposcopy, Continents, eBay [
16], MAGIC [
17], Megaspores [
18], OnlineNews [
19], Temperature1, Temperature2, and Wall [
20]. Relevant properties about these datasets (e.g., number of directional and non-directional features, number of classes, dimensionality) are presented in
Appendix B. Experiments in previous works [
4,
8] have shown that directional classifiers outperform traditional ones in these datasets, proving that directionality is an important attribute to exploit. Further details about the datasets, including their acquisition and preprocessing, were presented in [
4]. Additionally, in order to facilitate the convergence of the SVM-based models, all the non-directional features were scaled to the range 0–1.
Multiclass instances were handled using a one-versus-one approach for all the binary models (i.e., logistic regression and support vector machines). All the experiments detailed below were executed with a 3-fold stratified cross-validation technique (i.e., by preserving the percentage of samples for each class), selecting the best model in terms of accuracy, and the results of 30 different runs were averaged. Specifically, for each model and dataset, we have evaluated the accuracy and the macro F
-score, which corresponds to the unweighted mean value of the individual F
-scores of each class. Results of these experiments are summarized in
Table 1 and
Table 2, exhibiting average accuracy and macro F
-score, respectively, for 30 independent runs. The best results for each dataset are marked in bold. For reproducibility purposes, the source code and the training-testing partitions are made available (
https://github.com/dpernes/dirsvm). The results achieved by the von-Mises naive Bayes (vMNB) and directional Logistic Regression (dLR) align with the results reported in the literature [
4].
Hereafter, we will denote by non-Kernelized directional SVMs (nK-dSVM) the subset of proposed SVM variations with VC dimension equivalent to the one induced by the directional logistic regression; namely, the primal fixed-margin directional SVM with triangle (symmetric and asymmetric) and cosine waves. The remaining models (i.e., directional RBF, symmetric, and asymmetric kernels) will be referred to as Kernelized directional SVMs (K-dSVM).
Although some datasets used here are considerably imbalanced, accuracy and macro F-score values were fairly consistent with each other, in the sense that the best model in terms of accuracy was the top-1 model in terms of macro F-score in 10 of 12 datasets and was among the top-2 models in all datasets. While the dLR achieved a competitive general performance, it was surpassed by at least one of the proposed SVM alternatives in most cases. nK-dSVM performed better than dLR on small datasets, given the margin regularization imposed by the SVM loss function. For larger datasets, dLR performed better since the generalization induced by the nK-dSVM margin became less relevant. However, for large datasets, K-dSVM surpassed dLR and their non-kernelized counterparts in most cases. In general, dSVM with asymmetric margins (kernelized and non-kernelized) attained the best results, obtaining the best average performance on half of the datasets.
As shown in
Section 5, kernels involving the triangle wave correspond to inner products in an infinite-dimensional feature space. The same is also true for the directional RBF kernel. Non-kernelized methods, on the other hand, are constructed by explicitly defining the feature transformation, having a necessarily finite VC dimension. Therefore, the former produce models with higher capacity, which may lead to overfitting in small datasets, but better accuracy for large ones. This is confirmed by our experiments: the non-kernelized models achieved the best results in small datasets, while kernelized models built on top of the triangle wave and directional RBF kernel attained the best results in large datasets. The performance gains of kernelized models on the larger datasets were small, however, which may be explained by the unimodal distribution of the angular variables. On datasets with a multi-modal distribution of the directional variables, it is expected to observe higher gains by K-dSVM.
Towards Deep Directional Classifiers
Deep neural networks have achieved remarkable results in multiple machine learning problems and, particularly, in supervised classification. SVMs, on the one hand, typically decouple the data representation problem from the learning problem, by first projecting the data into a prespecified feature space and then learning a hyperplane that separates the two classes. Deep networks, on the other hand, jointly learn the data representation and the decision function, exhibiting superior performance mostly when trained on large datasets. In the context of directional data, we argue that significant performance improvements might be attained by combining the angular awareness of directional feature transformations or kernels with the representation learning provided by deep neural networks.
In order to evaluate the potential of deep classifiers for directional data, we present two further experiments in this section. Specifically, we have trained two Multilayer Perceptrons (MLPs), which were essentially identical, except for one important difference: one of them (denoted by rMLP) was trained on top of raw angle values (normalized to lie in a single period); the other one (denoted by dMLP) was trained on top of the feature transformation
, which defines the cosine kernel, applied to all angular components. The latter was a first attempt towards deep directional classifiers, while the former was completely unaware of the directionality of the data. Each hidden layer in the MLPs had the following structure: fully-connected transformation (dense layer) with 256 output neurons + batch normalization [
21] + ReLU + dropout [
22]. The output layer is a standard fully-connected transformation followed by a sigmoid, in the case of binary classification, or a softmax, when there are more than two classes. The models were trained to minimize the usual cross-entropy loss with
regularization. The total number of layers was chosen between 4 and 5 using 3-fold cross-validation, together with the remaining hyperparameters (dropout rate,
regularization weight, and learning rate). Training was performed for 200 epochs or until the loss plateaued. The training protocol, including the evaluated datasets, the number of runs for each dataset, and the evaluated metrics, was exactly the same as in the previous set of experiments.
Results are in
Table 3a,b, where we show again the values of our most accurate SVM (denoted by best-SVM) in each dataset for easier comparison. Like before, we observed high consistency between the accuracy and macro F
-score values. As expected, rMLP had the worst overall performance, and this effect was mostly apparent in small datasets where the number of directional features was in the same order of magnitude as the number of non-directional ones, like Colposcopy and eBay (see
Appendix B). In larger datasets and in those where the number of non-directional features was much larger than the number of directional ones (e.g., Behavior, OnlineNews), rMLP achieved more competitive results. The exception was the MAGIC dataset, where the single directional feature seemed to have a high discriminative power, and so, rMLP achieved the lowest performance among the three models. Contrary to what we just observed for rMLP, the gains of dMLP were highly encouraging. This model, built on top of a directional feature transformation, generally outperformed best-SVM in larger datasets and achieved competitive results even in smaller ones. This observation reinforces the role of directionality in these datasets and, more importantly, motivates the importance of further research to merge directional feature transformations and/or kernels with deep neural networks, which we plan to develop as future work.
10. Conclusions
Several concepts in real-life applications are represented by directional variables; from periodic time representation on calendars to compass directions. Traditional classifiers, which are unaware of the angular nature of these variables, might not properly model the data. Thereby, the study of directional classifiers is relevant for the machine learning community. Previous attempts to address classification tasks with directional variables focused on generative models [
8] and discriminative linear models (logistic regression) [
4].
In this work, we proposed several instantiations of directional-aware support vector machines. First, we modified the SVM decision function by considering parametric periodic mappings of the directional variables using cosine and triangle waves. Then, we proposed an extension of the model with triangular waves in order to allow asymmetric margins on the circle. The kernelized versions of these models were proposed as well. Furthermore, we analyzed and demonstrated the expressiveness of each proposed alternative.
In the experimental assessment, the relevance of the proposed models was evaluated, being able to achieve competitive results in most datasets. As expected, when compared to other shallow directional classifiers, kernelized models built on top of the triangle wave attained the best results in larger datasets, due to their large expressive power, which we have proven theoretically. One extra experiment combining a directional feature transformation and a deep neural network showed very promising results and clearly motivates further research.
Since the additional parameters involved in our asymmetric SVMs (in both kernelized and non-kernelized versions) have a periodic impact on the decision boundary or are constrained to a specific domain, using gradient-based optimization techniques may result in sub-optimal models. While this problem was circumvented by using fine-tuning from simpler models, there is research room for the design and exploration of optimization techniques specific for these models. Furthermore, deep multiple kernel learning [
23] is an unexplored research line in directional data settings that may lead to a unified framework combining directional kernel machines and deep neural networks.