Support Vector Machines and Kernels

Yao Yao on December 5, 2014

之前一直没搞清楚,这里理一理思路。

总结自 Support Vector Machines and Kernels for Computational BiologyCS229 Lecture note 3 和 ESL。


1. Intro

To start with, we will be considering a linear classifier for a binary classification problem with labels $ y \in {-1,1} $ and features $ x $. We will use parameters $ w $, $ b $ instead of $ \theta $, and write our classifier as

Here, $ g(z) = 1 $ if $ z \geq 0 $, and $ g(z) = −1 $ otherwise.

我们称

discriminant function

$ f(x) = 0 $ 就构成了我们的 hyperplane,intuition 什么的我就不说了。

考虑到 margin 时,我们要注意一个问题,那就是 margin 的度量。按 CS229 Lecture note 3 的说法,实际我们有:

这样其实有点不好记。另一种表达方式是:如果 $ x^{\ast} $ 是 support vector 的话,那么 $ f(x^{\ast}) = w^T x^{\ast} + b = \pm1 $(这个其实是 functional margin),此时 (geometric) margin 等于 $ \frac{1}{\lvert w \rvert} $。

顺便说一下记号:

  • $ w^T x = \left \langle w,x \right \rangle $ 叫 inner product (内积) 或者 dot product
  • $ \left \vert w \right \vert $ 称为 vector 的 norm (模)
  • $ \frac{w}{\left \vert w \right \vert } $ 称为 unit-length vector (单位向量,模为 1)

2. The Non-Linear Case

There is a straightforward way of turning a linear classifier non-linear, or making it applicable to non-vectorial data. It consists of mapping our data to some vector space, which we will refer to as the feature space, using a function $ \phi $. The discriminant function then is

Note that $ f(x) $ is linear in the feature space defined by the mapping $ \phi $; but when viewed in the original input space then it is a nonlinear function of $ x $ if $ \phi(x) $ is a nonlinear function.

这个 mapping $ \phi $ 可能会很复杂(比如 $ X = [x_1, x_2, x_3] $, $ \phi(X) = [x_1^2, \cdots, x_3^2] $),这样计算 $ f(x) $ 就很不方便。

而 Kernel 号称:

Kernel methods avoid this complexity by avoiding the step of explicitly mapping the data to a high dimensional feature-space.

接下来我们就来看下 Kernel 是如何做到这一点的。

3. Lagrange duality 登场

我们先不考虑 $ \phi $。

我们要求 maximum margin,就是求 $ \max{\frac{1}{\left \vert w \right \vert }} $,也就是求 $ \min{\left \vert w \right \vert } $。所以这个问题可以写成:

改写一下:

yes! $ (\ref{eqopt}) $ 出来啦!$ g_i(w) = - y^{(i)}(w^T x^{(i)} + b) + 1 $,然后 $ h_i(w) $ 不存在,所以标准的 Lagrangian $ L(x, \alpha, \beta) $ 中,$ x $ 要换成 $ (w,b) $,$ \beta $ 不需要,于是变成了:

Let’s find the dual form of the problem. To do so, we need to first minimize $ L(w, b, \alpha) $ with respect to $ w $ and $ b $ (for fixed $ \alpha $), to get $ \theta_{D} $, which we’ll do by setting the derivatives of $ L $ with respect to $ w $ and $ b $ to zero. We have:

剩下的 dual problem,KTT 什么的我就不推了。把上式代入 discriminant function 有:

Hence, if we’ve found the $ \alpha_i $’s, in order to make a prediction, we have to calculate a quantity that depends only on the inner product between $ x $ and the points in the training set. Moreover, we saw earlier that the $ \alpha_i $’s will all be 0 except for the support vectors. Thus, many of the terms in the sum above will be 0, and we really need to find only the inner products between $ x $ and the support vectors (of which there is often only a small number) in order calculate $ (\ref{eq7}) $ and make our prediction.

4. Kernels

现在我们再来考虑 $ \phi $。类似地,当 $ f(x) = w^T \phi(x) + b $ 时,我们按上面那一套可以得到:

这样我们就可以定义 kernel function 为:

除了上一节末尾说的计算方便之外,kernel 还有一个作用就是:我现在可以不用关心 $ \phi $ 具体是个什么函数,我只要把 $ \left \langle \phi(x^{(i)}),\phi(x) \right \rangle $ 设计出来就可以了。类似于 “屏蔽底层技术细节”。

最后,我觉得 kernel 的命名应该是 “kernel of discriminant function” 的意思。

5. Kernel Examples

Popular choices for $ K $ in the SVM literature are:

  • linear kernel: $ K(x, x’) = \left \langle x,x’ \right \rangle $
    • 相当于没有用 $ \phi $ 或者 $ \phi(x) = x $
  • dth-Degree polynomial kernel:
    • homogeneous: $ K(x, x’) = \left \langle x,x’ \right \rangle^d $
    • inhomogeneous: $ K(x, x’) = (1 + \left \langle x,x’ \right \rangle)^d $
  • Gaussian kernel: $ K(x, x’) = \exp(-\frac{\left \vert x-x’ \right \vert ^2}{2 \sigma^2}) $
  • Radial basis kernel: $ K(x, x’) = \exp(-\gamma \left \vert x-x’ \right \vert ^2) $
  • Neural network kernel: $ K(x, x’) = tanh(k_1 \left \langle x,x’ \right \rangle + k_2) $
    • tanh is hyperbolic tangent
    • $ sinh(x) = \frac{e^x - e^{-x}}{2} $
    • $ cosh(x) = \frac{e^x + e^{-x}}{2} $
    • $ tanh(x) = \frac{sinh(x)}{cosh(x)} = \frac{e^x - e^{-x}}{e^x + e^{-x}} $

5.2 Kernels for Sequences

Support Vector Machines and Kernels for Computational Biology P12 说到了,我就简单写一下。

5.2.1 Kernels Describing $ \ell $-mer Content

我们要做的就是把一个 sequence 映射到 feature space 的一个 vector。我们可以这样设计 feature coding:

  • 考虑所有的 dimer,以 ACGT 的顺序,$ x_1 $ 表示 AA 的个数,$ x_2 $ 表示 AC 的个数,……,$ x_{16} $ 表示 TT 的个数
  • 如果要区分 intron 和 exon 的话,那么可以设计成:$ x_1 $ 表示 intronic AA 的个数,……,$ x_{16} $ 表示 intronic TT 的个数,$ x_{17} $ 表示 exonic AA 的个数,……,$ x_{32} $ 表示 exonic TT 的个数
  • 比如一个 sequence 是 intro ACT,那么就只有 intronic AC 和 intronic CT 上是两个 1,其余全 0。这样的一个 vector 称为 sequence 的 spectrum

我们把 sequence 映射到 $ \ell $-mer spectrum 的函数命名为 $ \Phi_{\ell}^{spectrum}(x) $,于是可以得到一个 spectrum kernel:

Since the spectrum kernel allows no mismatches, when $ \ell $ is sufficiently long the chance of observing common occurrences becomes small and the kernel will no longer perform well. This problem is alleviated if we use the mixed spectrum kernel:

where $ \beta_d $ is a weighting for the different substring lengths.

5.2.2 Kernels Using Positional Information

Analogous to Position Weight Matrices (PWMs), the idea is to analyze sequences of fixed length $ L $ and consider substrings starting at each position $ l = 1 ,\cdots,L $ separately, as implemented by the so-called weighted degree (WD) kernel:

where $ x_{[l:l+d]} $ is the substring of length $ d $ at position $ l $. A suggested setting for $ \beta_d $ is $ \beta_d = \frac{2(\ell-d+1)}{\ell^2 + \ell} $.



blog comments powered by Disqus