Documentation - C API
PEGASOS SVM solver
Author:
Andrea Vedaldi

pegasos.h provides a basic implementation of the PEGASOS [9] linear SVM solver.

Overview

PEGASOS solves the linear SxVM learning problem

\[ \min_{w} \frac{\lambda}{2} \|w\|^2 + \frac{1}{m} \sum_{i=1}^n \ell(w; (x_i,y_i)) \]

where $ x_i $ are data vectors in $ \mathbb{R}^d $, $ y_i \in \{-1,1\} $ are binary labels, $ \lambda > 0 $ is the regularization parameter, and

\[ \ell(w;(x_i,y_i)) = \max\{0, 1 - y_i \langle w,x_i\rangle\}. \]

is the hinge loss. The result of the optimization is a model $ w \in \mathbb{R}^d $ that yields the decision function

\[ F(x) = \operatorname{sign} \langle w, x\rangle. \]

It is well known that the hinge loss is a convex upper bound of the i01-loss of the decision function:

\[ \ell(w;(x,y)) \geq \frac{1}{2}(1 - y F(x)). \]

PEGASOS is accessed by calling vl_pegasos_train_binary_svm_d or vl_pegasos_train_binary_svm_f, operating respectively on double or float data.

Algorithm

PEGASOS [9] is a stochastic subgradient optimizer. At the t-th iteration the algorithm:

  • Samples uniformly at random as subset $ A_t$ of k of training pairs $(x,y)$ from the m pairs provided for training (this subset is called mini batch).
  • Computes a subgradient $ \nabla_t $ of the function $ E_t(w) = \frac{1}{2}\|w\|^2 + \frac{1}{k} \sum_{(x,y) \in A_t} \ell(w;(x,y)) $ (this is the SVM objective function restricted to the minibatch).
  • Compute an intermediate weight vector $ w_{t+1/2} $ by doing a step $ w_{t+1/2} = w_t - \alpha_t \nalba_t $ with learning rate $ \alpha_t = 1/(\eta t) $ along the subgradient. Note that the learning rate is inversely proportional to the iteration numeber.
  • Back projects the weight vector $ w_{t+1/2} $ on the hypersphere of radius $ \sqrt{\lambda} $ to obtain the next model estimate $ w_{t+1} $:

    \[ w_t = \min\{1, \sqrt{\lambda}/\|w\|\} w_{t+1/2}. \]

    The hypersfere is guaranteed to contain the optimal weight vector $ w^* $ [9] .

VLFeat implementation fixes to one the size of the mini batches $ k $.

Bias

PEGASOS SVM formulation does not incorporate a bias. To learn an SVM with bias, the each data vector $ x $ can be extended by a constant component $ B $ (called biasMultiplier in the code). In this case, the model $ w $ has dimension $ D + 1 $ and the SVM discriminat function is given by $ F(x) = \operatorname{sign} (\langle w_{1:d}, x\rangle+ w_{d+1} B) $. If the bias multiplier B is large enough, the weight $ w_{d+1} $ remains small and it has small contribution in the SVM regularization term $ \| w \|^2 $, better approximating the case of an SVM with bias. Unfortunately, setting the bias multiplier $ B $ to a large value makes the optimization harder.

Restarting

VLFeat PEGASOS implementation can be restatred after any given number of iterations. This is useful to compute intermediate statistics or to load new data from disk for large datasets. The state of the algorithm, which is required for restarting, is limited to the current estimate $ w_t $ of the SVM weight vector and the iteration number $ t $.

Permutation

VLFeat PEGASOS can use a user-defined permutation to decide the order in which data points are visited (instead of using random sampling). By specifying a permutation the algorithm is guaranteed to visit each data point exactly once in each loop. The permutation needs not to be bijective. This can be used to visit certain data samples more or less often than others, implicitly reweighting their relative importance in the SVM objective function. This can be used to blanace the data.

Non-linear kernels

PEGASOS can be extended to non-linear kernels, but the algorithm is not particularly efficient in this setting [1]. When possible, it may be preferable to work with explicit feature maps.

Let $ k(x,y) $ be a positive definite kernel. A feature map is a function $ \Psi(x) $ such that $ k(x,y) = \langle \Psi(x), \Psi(y) \rangle $. Using this representation the non-linear SVM learning objective function writes:

\[ \min_{w} \frac{\lambda}{2} \|w\|^2 + \frac{1}{m} \sum_{i=1}^n \ell(w; (\Psi(x)_i,y_i)). \]

Thus the only difference with the linear case is that the feature $ \Psi(x) $ is used in place of the data $ x $.

$ \Psi(x) $ can be learned off-line, for instance by using the incomplete Cholesky decomposition $ V^\top V $ of the Gram matrix $ K = [k(x_i,x_j)] $ (in this case $ \Psi(x_i) $ is the i-th columns of V). Alternatively, for additive kernels (e.g. intersection, Chi2) the explicit feature map computed by homkermap.h can be used.