3.1. SA-DMKL Architecture
The choice of kernel functions is very important for the performance of the kernel learning algorithm. Multiple kernel learning tries to select the appropriate kernel function according to some criteria. However, the major problem is that there are too many parameters with the increasing of the kernel functions, and deep multiple kernel learning makes the situation worse. Moreover, the fixed architecture in DMKL cannot adapt to the complexity of the training data. This paper proposes a self-adaptive deep multiple kernel learning architecture to tackle this problem.
Our architecture is not very sensitive to the initial parameter settings of each candidate kernel function. In each layer, the parameters of each base candidate kernel function are optimized with the grid search method. The number of the layer is not fixed in SA-DMKL architecture. If the parameter settings are inappropriate, our model learning algorithm can adjust the architecture at the next layer. This means that SA-DMKL consists of multilayer multiple base kernel, as shown in
Figure 1. In the first layer, the training data is used to separately train each candidate base kernel-based support vector machine (SVM), and the parameters of each base kernel will be adjusted by the grid search method. We evaluate each base kernel function using the generalization bound based on Rademacher chaos complexity and drop out the base kernels with larger generalization bound. The outputs of the rest kernel functions are used to construct a new feature space, and the dimension of this new feature space is the number of the rest kernel functions. For each training data, there is a corresponding new data in the new feature space. Those new data are input to the next layer to train each candidate base kernel with SVM. In the final layer, a kernel-based SVM is used to classify the training data.
3.2. Model Learning Algorithm
Given a set of training data , where is the feature vector and is the class label. Our goal is to learn a deep multiple kernel network and a classifier f from the labeled training data. Here, f is an SVM based classifier.
Let be a set of base candidate kernel functions, where is a feature map function. The Rademacher chaos complexity of is estimated according to the following rules:
If
k is a Gaussian-type kernel, then
where , e is the base of the natural logarithm, and m is the number of elements of the base kernel function set K.
The generalization bound can be summarized from inequations (
3) to (
5). The local Lipschitz constant
,
is estimated according to Equations (
6) and (
7), where
is the loss function, and
is the regularization parameter of a two-layer minimization problem:
In our model learning algorithm, the generalization bound is used to select the base kernel function. If the generalization bound is larger than the threshold, our algorithm will drop the corresponding base kernel out. This means that the dropout base kernel has poor generalization ability.
The learning performance of the SA-DMKL method is evaluated in terms of test accuracy according to Equation (
8), which is the proportion of the correct classified samples to the total number of samples:
where
TP is the number of true positive,
TN is the number of true negatives, and
N is the total number of instances in the test set.
In the model learning algorithm, the test accuracy is evaluated in each layer to decide whether the growth of the model ceases or not. If the test accuracy does not change in the fixed iterations, the iteration of the learning algorithm should be stopped.
The overall procedure of our model learning algorithm is described in Algorithm 1.
Algorithm 1 SA-DMKL algorithm. |
Input:m: Number of candidate kernels;: Initial parameters of each kernel function; D: Dataset; l: Maximum number of layers in which the best accuracy does not change; : Threshold value of the generalization bound.
Output: Final model M.
- 1:
Initialize best accuracy ; - 2:
Initialize maximum number of iteration ; - 3:
Initialize current iteration and flag ; - 4:
repeat - 5:
Randomly select 60 percent of samples from the entire dataset D as training samples ; - 6:
Use grid search method to adjust the initial parameters ; - 7:
Use to train m candidate kernel functions to create m SVMs; - 8:
Use m SVMs to predict the rest dataset D- and compute the test accuracy , generalization bound , and ; where ; - 9:
Initialize loop parameter = 1, and new dataset = ∅; - 10:
repeat - 11:
If then concatenate and to generate new ; - 12:
If then assign to and assign 0 to j; - 13:
; - 14:
until ; - 15:
If does not change then the flag j adds one; - 16:
Add one to i and assign to D; - 17:
until ( or )
|
According to Algorithm 1, each iteration from step 4 to step 17 builds one layer of the SA-DMKL architecture. In Algorithm 1, i stands for layer number, and j records the number of layers while remains the same. Step 5 to step 6 train m SVMs. In step 8, is calculated by tth kernel function with input data D, which is used in the next layer. In each iteration, the accuracy performance is evaluated for each support vector machine (SVM). If the best accuracy does not change in fixed iterations, the iteration should be stopped. Furthermore, the corresponding kernel of the SVM is discarded if its generalization bound is higher than the threshold value within step 10 to step 14.