ISL: Classification - Conceptual Exercises

Yao Yao on September 25, 2014

Exercises of Chapter 4, An Introduction to Statistical Learning.

Exercise 2

(P139) It was stated in the text that classifying an observation to the class for which (4.12) is largest is equivalent to classifying an observation to the class for which (4.13) is largest. Prove that this is the case. In other words, under the assumption that the observations in the kth class are drawn from a $N(\mu_k, \sigma^2)$ distribution, the Bayes’s classifier assigns an observation to the class for which the discriminant function is maximized.

Assuming that $f_k(x)$ is normal, the probability that an observation $x$ is in class $k$ is given by

while the discriminant function is given by

Claim: Maximizing $p_k(x)$ is equivalent to maximizing $\delta_k(x)$.

Proof:

Let $x$ remain fixed and observe that we are maximizing over the parameter $k$. Suppose that $\delta_k(x) \geq \delta_i(x)$. We will show that $f_k(x) \geq f_i(x)$.

From our assumption we have

Exponentiation is a monotonically increasing function, so the following inequality holds

Multipy this inequality by the positive constant

and we have that

or equivalently, $p_k(x) \geq p_i(x)$.

Reversing these steps also holds, so we have that maximizing $\delta_k$ is equivalent to maximizing $p_k$.

Exercise 3

This problem relates to the QDA model, in which the observations within each class are drawn from a normal distribution with a class-specific mean vector and a class-specific covariance matrix. We consider the simple case where $p = 1$; i.e. there is only one feature.

Suppose that we have $K$ classes, and that if an observation belongs to the k^th class then $X$ comes from a one-dimensional normal distribution, $X \sim N(\mu_k, \sigma^2 _k)$. Recall that the density function for the one-dimensional normal distribution is given in (4.11). Prove that in this case, the Bayes’ classifier is not linear. Argue that it is in fact quadratic.

Hint: For this problem, you should follow the arguments laid out in Section 4.4.2, but without making the assumption that $\sigma^2_1 = \cdots = \sigma^2_k$.

Exercise 4

When the number of features $p$ is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality, and it ties into the fact that curse non-parametric approaches often perform poorly when $p$ is large. We will now investigate this curse.

When $p = 1$, we assume that $X$ is uniformly (evenly) distributed on [0, 1]. For a certain $x$, we use $[x-0.05, x+0.05]$ as its neighborhood. So about 10% of the observations will be used to make the prediction.

When $p = 2$, we assume that $X_1$ and $X_2$ are uniformly (evenly) distributed on [0, 1]. Now only 1% of the observations will be used to make the prediction.

Exercise 8

Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e. $K = 1$) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?

For KNN with $K=1$, the training error rate is definitely 0% because for any training observation (inputted as test observation), its nearest neighbor will be itself. So, KNN has a test error rate of 36% > 30%.