# Support Vector Machines and Kernels

Yao Yao on December 5, 2014

## 1. Intro

To start with, we will be considering a linear classifier for a binary classification problem with labels $y \in {-1,1}$ and features $x$. We will use parameters $w$, $b$ instead of $\theta$, and write our classifier as

Here, $g(z) = 1$ if $z \geq 0$, and $g(z) = −1$ otherwise.

discriminant function

$f(x) = 0$ 就构成了我们的 hyperplane，intuition 什么的我就不说了。

• $w^T x = \left \langle w,x \right \rangle$ 叫 inner product (内积) 或者 dot product
• $\left \vert w \right \vert$ 称为 vector 的 norm (模)
• $\frac{w}{\left \vert w \right \vert }$ 称为 unit-length vector (单位向量，模为 1)

## 2. The Non-Linear Case

There is a straightforward way of turning a linear classifier non-linear, or making it applicable to non-vectorial data. It consists of mapping our data to some vector space, which we will refer to as the feature space, using a function $\phi$. The discriminant function then is

Note that $f(x)$ is linear in the feature space defined by the mapping $\phi$; but when viewed in the original input space then it is a nonlinear function of $x$ if $\phi(x)$ is a nonlinear function.

Kernel methods avoid this complexity by avoiding the step of explicitly mapping the data to a high dimensional feature-space.

## 3. Lagrange duality 登场

yes! $(\ref{eqopt})$ 出来啦！$g_i(w) = - y^{(i)}(w^T x^{(i)} + b) + 1$，然后 $h_i(w)$ 不存在，所以标准的 Lagrangian $L(x, \alpha, \beta)$ 中，$x$ 要换成 $(w,b)$，$\beta$ 不需要，于是变成了：

Let’s find the dual form of the problem. To do so, we need to first minimize $L(w, b, \alpha)$ with respect to $w$ and $b$ (for fixed $\alpha$), to get $\theta_{D}$, which we’ll do by setting the derivatives of $L$ with respect to $w$ and $b$ to zero. We have:

Hence, if we’ve found the $\alpha_i$’s, in order to make a prediction, we have to calculate a quantity that depends only on the inner product between $x$ and the points in the training set. Moreover, we saw earlier that the $\alpha_i$’s will all be 0 except for the support vectors. Thus, many of the terms in the sum above will be 0, and we really need to find only the inner products between $x$ and the support vectors (of which there is often only a small number) in order calculate $(\ref{eq7})$ and make our prediction.

## 5. Kernel Examples

Popular choices for $K$ in the SVM literature are:

• linear kernel: $K(x, x’) = \left \langle x,x’ \right \rangle$
• 相当于没有用 $\phi$ 或者 $\phi(x) = x$
• dth-Degree polynomial kernel:
• homogeneous: $K(x, x’) = \left \langle x,x’ \right \rangle^d$
• inhomogeneous: $K(x, x’) = (1 + \left \langle x,x’ \right \rangle)^d$
• Gaussian kernel: $K(x, x’) = \exp(-\frac{\left \vert x-x’ \right \vert ^2}{2 \sigma^2})$
• Radial basis kernel: $K(x, x’) = \exp(-\gamma \left \vert x-x’ \right \vert ^2)$
• Neural network kernel: $K(x, x’) = tanh(k_1 \left \langle x,x’ \right \rangle + k_2)$
• tanh is hyperbolic tangent
• $sinh(x) = \frac{e^x - e^{-x}}{2}$
• $cosh(x) = \frac{e^x + e^{-x}}{2}$
• $tanh(x) = \frac{sinh(x)}{cosh(x)} = \frac{e^x - e^{-x}}{e^x + e^{-x}}$

### 5.2 Kernels for Sequences

Support Vector Machines and Kernels for Computational Biology P12 说到了，我就简单写一下。

#### 5.2.1 Kernels Describing $\ell$-mer Content

• 考虑所有的 dimer，以 ACGT 的顺序，$x_1$ 表示 AA 的个数，$x_2$ 表示 AC 的个数，……，$x_{16}$ 表示 TT 的个数
• 如果要区分 intron 和 exon 的话，那么可以设计成：$x_1$ 表示 intronic AA 的个数，……，$x_{16}$ 表示 intronic TT 的个数，$x_{17}$ 表示 exonic AA 的个数，……，$x_{32}$ 表示 exonic TT 的个数
• 比如一个 sequence 是 intro ACT，那么就只有 intronic AC 和 intronic CT 上是两个 1，其余全 0。这样的一个 vector 称为 sequence 的 spectrum

Since the spectrum kernel allows no mismatches, when $\ell$ is sufficiently long the chance of observing common occurrences becomes small and the kernel will no longer perform well. This problem is alleviated if we use the mixed spectrum kernel:

where $\beta_d$ is a weighting for the different substring lengths.

#### 5.2.2 Kernels Using Positional Information

Analogous to Position Weight Matrices (PWMs), the idea is to analyze sequences of fixed length $L$ and consider substrings starting at each position $l = 1 ,\cdots,L$ separately, as implemented by the so-called weighted degree (WD) kernel:

where $x_{[l:l+d]}$ is the substring of length $d$ at position $l$. A suggested setting for $\beta_d$ is $\beta_d = \frac{2(\ell-d+1)}{\ell^2 + \ell}$.