# Bayesian Interpretation for Ridge Regression and the Lasso + Exercise 7

Yao Yao on October 2, 2014

## 1. 啥是 Bayes

### 1.1. Dictionary

• Bayes: [ˈbeɪz]
• a priori: [ˌɑpriˈɔri], from Latin a priori (“former”), literally “from the former”.
• (logic) Based on hypothesis rather than experiment
• 翻译成 “先验的” 应该是指 before experiment 的意思
• A priori knowledge or justification is independent of experience (for example “All bachelors are unmarried”). 有一种 “显而易见，无需证明” 的感觉。
• Presumed without analysis
• One assumes, a priori, that a parent would be better at dealing with problems, 想当然地
• a posteriori: [ˌɑpɒsteriˈɔ:rɪ] or [ˌeɪpɒsteriˈɔ:raɪ], from Latin a posteriori (“latter”), literally “from the latter”.
• Relating to or derived by reasoning from observed facts; Empirical
• A posteriori knowledge or justification is dependent on experience or empirical evidence (for example “Some bachelors I have met are very happy”).
• What Locke calls “knowledge” they have called “a priori knowledge”; what he calls “opinion” or “belief” they have called “a posteriori” or “empirical knowledge”.
• prior: [ˈpraɪə(r)]
• posterior: [pɒˈstɪəriə(r)]

### 1.3 Bayes’ theorem 的变形

• $B$: 袋子里黑白球的比例是 blah blah blah
• $A$: 在不知道袋子里面黑白球比例的情况下，摸了 xxx 个球，yyy 个白的，zzz 个黑的
• $p(B \vert A)$ is posterior (probablity) distribution
• the probablity of $B$ posterior to (after) the observation of $A$
• 注意我们这里不说 $p(B \vert A)$ 是 posterior probablity。因为严格说来 $p(B \vert A)$ 是一个分布律，是一个概率函数，从定义上说是一个分布，而不是一个具体的概率值。当然你理解成一个概率值也无可厚非。prior 同。
• $p(A \vert B)$ is likelihood
• reversely, when $B$ happened, how likely will $A$ happen?
• 从 1.4 来看，似乎不能直接叫 likelihood，待调查
• $p(B)$ is prior (probablity) distribution
• prior to (before) any observation, what is the chance of $B$?
• $p(A)$ is the probablity of evidence
• $A$ 是已经发生的，是事实，是我们推测 $B$ 的 evidence

$\propto$ 读作 is proportional to 或 varies as。

$y \propto x$ simply means that $y = kx$ for some constant $k$. (符号解释摘自 List of mathematical symbols)

### 1.4 在 regression 中的应用

• $Y$: the observed data
• $\theta$: the parameters
• $P(Y \vert \theta)$: the joint distribution of the sample, which is proportional to the likelihood function
• $P(\theta)$: the prior distribution of the parameters

## 2. Bayesian Interpretation for Ridge Regression and the Lasso

• $p(Y \vert X,\beta)$ is the joint distribution over outputs $Y$ given inputs $X$ and the parameters $\beta$.
• The likelihood of any fixed parameter vector $\beta$ is $L(\beta \vert X) = p(Y \vert X,\beta)$

## 3. Exercise 7

We will now derive the Bayesian connection to the lasso and ridge regression discussed in Section 6.2.2.

### (a) Question

Suppose that $y_i = \beta_0 + \sum_{j=1}^{p}{x_{ij} \beta_j} + \epsilon_i$ where $\epsilon_1, \cdots, \epsilon_n$ are independent and identically distributed from a $N(0, \sigma^2)$ distribution. Write out the likelihood for the data.

The likelihood for the data is:

### (b) Question

Assume the following prior for $\beta$: $\beta_1, \cdots, \beta_p$ are independent and identically distributed according to a double-exponential distribution with mean 0 and common scale parameter $b$: i.e. $p(\beta) = \frac{1}{2b} exp(− \frac{\lvert \beta \rvert }{b})$. Write out the posterior for $\beta$ in this setting.

The posterior with double exponential (Laplace Distribution) with mean 0 and common scale parameter $b$, i.e. $p(\beta) = \frac{1}{2b}\exp(- \lvert \beta \rvert / b)$ is:

Substituting our values from (a) and our density function gives us:

### (c) Question

Argue that the lasso estimate is the mode for $\beta$ under this posterior distribution.

Showing that the Lasso estimate for $\beta$ is the mode under this posterior distribution is the same thing as showing that the most likely value for $\beta$ is given by the lasso solution with a certain $\lambda$.

We can do this by taking our likelihood and posterior and showing that it can be reduced to the canonical Lasso Equation 6.7 from the book.

Let’s start by simplifying it by taking the logarithm of both sides:

We want to maximize the posterior, this means:

Since we are taking the difference of two values, the maximum of this value is the equivalent to taking the difference of the second value in terms of $\beta$. This results in:

By letting $\lambda = \frac{2\sigma^2}{b}$, we can see that we end up with:

which we know is the Lasso from Equation 6.7 in the book. Thus we know that when the posterior comes from a Laplace distribution with mean zero and common scale parameter $b$, the mode for $\beta$ is given by the Lasso solution when $\lambda = \frac{2\sigma^2}{b}$.

### (d) Question

Now assume the following prior for $\beta$: $\beta_1, \cdots, \beta_p$ are independent and identically distributed according to a normal distribution with mean zero and variance $c$. Write out the posterior for $\beta$ in this setting.

The posterior distributed according to Normal distribution with mean 0 and variance $c$ is:

Our probability distribution function then becomes:

Substituting our values from (a) and our density function gives us:

### (e) Question

Argue that the ridge regression estimate is both the mode and the mean for $\beta$ under this posterior distribution.

Like from part (c), showing that the Ridge Regression estimate for $\beta$ is the mode and mean under this posterior distribution is the same thing as showing that the most likely value for $\beta$ is given by the lasso solution with a certain $\lambda$.

We can do this by taking our likelihood and posterior and showing that it can be reduced to the canonical Ridge Regression Equation 6.5 from the book.

Once again, we can take the logarithm of both sides to simplify it:

We want to maximize the posterior, this means:

Since we are taking the difference of two values, the maximum of this value is the equivalent to taking the difference of the second value in terms of $\beta$. This results in:

By letting $\lambda = \frac{\sigma^2}{c}$, we end up with:

which we know is the Ridge Regression from Equation 6.5 in the book. Thus we know that when the posterior comes from a normal distribution with mean zero and variance $c$, the mode for $\beta$ is given by the Ridge Regression solution when $\lambda = \frac{\sigma^2}{c}$. Since the posterior is Gaussian, we also know that it is the posterior mean.