This is a summary of chapter 10 of the *Introduction to Statistical Learning* textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at my website, or here at Hashnode. Subscribe to stay up to date on my latest Data Science & Engineering guides!

The statistical methods from the previous chapters focused on supervised learning. Again, supervised learning is where we have access to a set of predictors *X*, and a response *Y*. The goal is to predict *Y* by using the predictors.

In unsupervised learning, we have a set of features *X*, but no response. The goal is not to predict anything. Instead, the goal is to learn information about the features, such as discovering subgroups or relationships.

In this chapter, we will cover two common methods of unsupervised learning: principal components analysis (PCA) and clustering. PCA is useful for data visualization and data pre-processing before using supervised learning methods. Clustering methods are useful for discovering unknown subgroups or relationships within a dataset.

Unsupervised learning is more challenging than supervised learning because it is more subjective. There is no simple goal for the analysis. Also, it is hard to assess results because there is no “true” answer. Supervised learning is usually performed as part of the data exploratory process.

Principal components analysis was previously covered as part of principal components regression fromchapter six. Therefore, this section will be nearly identical to the explanation from chapter six, but without the regression aspects.

When we have a dataset with a large number of predictors, dimension reduction methods can be used to summarize the dataset with a smaller number of representative predictors (dimensions) that collectively explain most of the variability in the data. Each of the dimensions is some linear combination of all of the predictors. For example, if we had a dataset of 5 predictors, the first dimension (*Z*₁) would be as follows:

Note that we will always have less dimensions than the number of predictors.

The *ϕ* values for the dimensions are known as loading values. The values are subject to the constraint that the square of the *ϕ* values in a dimension must equal one. For our above example, this means that:

The *ϕ* values in a dimension make up a “loading vector” that defines a “direction” to explain the predictors. But how exactly do we come up with the *ϕ* values? They are determined through principal components analysis.

Principal components analysis is the most popular dimension reduction method. The method involves determining different principal component directions of the data.

The first principal component direction (*Z*₁) defines the direction along which the data varies the most. In other words, it is a linear combination of all of the predictors, such that it explains most of the variance in the predictors.

The second principal component direction (*Z*₂) defines another direction along which the data varies the most, but is subject to the constraint that it must be uncorrelated with the first principal component, *Z*₁.

The third principal component direction (*Z*₃) defines another direction along which the data varies the most, but is subject to the constraint that it must be uncorrelated with both of the previous principal components, *Z*₁ and *Z*₂.

And so on and so forth for additional principal component directions.

Dimension reduction is best explained with an example. Assume that we have a dataset of different baseball players, which consists of their statistics in 1986, their years in the league, and their salaries in the following year (1987).

We need to perform dimension reduction by transforming our 7 different predictors into a smaller number of principal components to use for regression.

It is important to note that prior to performing principal components analysis, each predictor should be standardized to ensure that all of the predictors are on the same scale. The absence of standardization will cause the predictors with high variance to play a larger role in the final principal components obtained.

Performing principal components analysis would result in the following *ϕ* values for the first three principal components:

A plot of the loading values of the first two principal components would look as follows:

How do we interpret these principal components?

If we take a look at the first principal component, we can see that there is approximately an equal weight placed on each of the six baseball statistic predictors, and much less weight placed on the years that a player has been in the league. This means that the first principal component roughly corresponds to a player’s level of production.

On the other hand, in the second principal component, we can see that there is a large weight placed on the number of years that a player has been in the league, and much less weight placed on the baseball statistics. This means that the second principal component roughly corresponds to how long a player has been in the league.

In the third principal component, we can see that there is more weight placed on three specific baseball statistics: home runs, RBIs, and walks. This means that the third principal component roughly corresponds to a player’s batting ability.

Performing principal components analysis also tells us the percent of variation in the data that is explained by each of the components. The first principal component from the baseball data explains 67% of the variation in the predictors. The second principal component explains 15%. The third principal component explains 9%. Therefore, together, these three principal components explain 91% of the variation in the data.

This helps explain the key idea of principal components analysis, which is that a small number of principal components are sufficient to explain most of the variability in the data. Through principal components analysis, we’ve reduced the dimensions of our dataset from seven to three.

The number of principal components to use can be chosen by using the number of components that explain a large amount of the variation in the data. For example, in the baseball data, the first three principal components explain 91% of variation in the data, so using just the first three is a valid option. Aside from that, there is no objective way to decide, unless PCA is being used in the context of a supervised learning method, such as principal components regression. In that case, we can perform cross-validation to choose the number of components that results in the lowest test error.

Principal components analysis can ultimately be used in regression, classification, and clustering methods.

Clustering refers to a broad set of techniques of finding subgroups in a dataset. The two most common clustering methods are:

- K-means Clustering
- Hierarchical Clustering

In K-means clustering, we are looking to partition our dataset into a pre-specified number of clusters.

In hierarchical clustering, we don’t know in advance how many clusters we want. Instead, we end up with a tree-like visual representation of the observations, called a dendogram, which allows us to see at once the clusterings obtained for each possible number of clusters from 1 to *n*.

K-means clustering involves specifying a desired number of clusters *K* and assigning each observation to one of the clusters. The clusters are determined such that the total within-cluster variation, summed over all *K* clusters, is minimized. Within-cluster variation is defined as the squared Euclidean distance. The general process of performing K-means clustering is as follows:

- Randomly assign each observation to one of the clusters. These will serve as the initial cluster assignments.
- For each of the clusters, compute the cluster centroid. The cluster centroid is the mean of the observations assigned to the cluster.
- Reassign each observation to the cluster whose centroid is the closest.
- Continue repeating steps 2 & 3 until the result no longer changes.

A visual of the process looks something like the following graphic:

However, there is an issue with K-means clustering in that the final results depend on the initial random cluster assignment. Therefore, the algorithm should be run multiple times and the result that minimizes the total within-cluster variation should be chosen.

Hierarchical clustering does not require choosing the number of clusters in advance. Additionally, it results in a tree-based representation of the observations, known as a dendogram.

The most common type of hierarchical clustering is bottom-up (agglomerative) clustering. The bottom-up phrase refers to the fact that a dendogram is built starting from the leaves and combining clusters up to the trunk.

The following graphic is an example of hierarchical clustering in action:

At the bottom of a dendogram, each leaf represents one of the observations in the data. As we move up the tree, some leaves begin to fuse into branches. These observations that fuse are observations that are quite similar to each other. As we move higher up the tree, branches fuse with leaves or with other branches. Additionally, the earlier that fusions occur, the more similar the groups of observations are to each other.

When interpreting a dendogram, be careful not to draw any conclusions about the similarity of observations based on their proximity along the horizontal axis. In other words, if two observations are next to each other on a dendogram, it doesn’t mean that they are similar.

Instead, we draw conclusions about the similarity of two observations based on their location on the vertical axis where branches containing both of those observations are first fused.

In hierarchical clustering, the final clusters are identified by making a horizontal cut across the dendogram. The set of observations below the cut are clusters. The attractive aspect of hierarchical clustering is that a single dendogram can be used to obtain multiple different clustering options. However, the choice of where to make the cut is often not clear. In practice, people usually look at the dendogram and select a sensible number of clusters, based on the heights of the fusion and the number of clusters desired.

The term “hierarchical” refers to the fact that clusters obtained by cutting a dendogram at some height are nested within clusters that are obtained by cutting the dendogram at a greater height. Therefore, the assumption of a hierarchical structure might be unrealistic for some datasets, in which case K-means clustering would be better. An example of this would be if we had a dataset consisting of males and females from America, Japan, and France. The best division into two groups would be by gender. The best division into three groups would be by nationality. However, we cannot split one of the two gender clusters and end up with three nationality clusters. In other words, the true clusters are not nested.

The hierarchical clustering dendogram is obtained through the following process:

- First, we define some dissimilarity measure for each pair of observations in the dataset and initially treat each observation as its own cluster.
- The two clusters most similar to each other are fused.
- The algorithm proceeds iteratively until all observations belong to one cluster and the dendogram is complete. However, the concept of dissimilarity is extended to be able to define dissimilarity between two clusters if one or both have multiple observations. This is known as linkage, of which there are four types.

See the hierarchical clustering graphic above for an example of this process.

Complete linkage involves computing all pairwise dissimilarities between the observations in cluster A and cluster B, and recording the largest of the dissimilarities. Complete linkage is commonly used in practice.

Single linkage is the opposite of complete linkage, where we record the smallest dissimilarity instead of the largest. Single linkage can result in extended, trailing clusters where single observations are fused one-at-a-time.

Average linkage involves computing all pairwise dissimilarities between the observations in cluster A and cluster B, and recording the average of the dissimilarities. Average linkage is commonly used in practice.

Centroid linkage involves determining the dissimilarity between the centroid for cluster A and the centroid for cluster B. Centroid linkage is often used in genomics. However, it suffers from the drawback of potential inversion, where two clusters are fused at a height below either of the individual clusters in the dendogram.

As mentioned previously, the first step of the hierarchical clustering algorithm involves defining some initial pairwise dissimilarity measure for each pair of individual observations in the dataset. There are two options for this:

- Euclidean distance
- Correlation-based distance

The Euclidean distance is simply the straight-line distance between two observations.

Correlation-based distance considers two observations to be similar if their features are highly correlated, even if the observations are far apart in terms of Euclidean distance. In other words, correlation-based distance focuses on the shapes of observation profiles instead of their magnitude.

The choice of the pairwise dissimilarity measure has a strong effect on the final dendogram obtained. Therefore, the type of data being clustered and the problem trying to be solved should determine the measure to use. For example, assume that we have data on shoppers and the number of times each shopper has bought a specific item. The goal is to cluster shoppers together to ultimately show different advertisements to different clusters.

- If we use Euclidean distance, shoppers who have bought very few items overall will be clustered together.
- On the other hand, correlation-based distance would group together shoppers with similar preferences, regardless of shopping volume.

Correlation-based distance would be the preferred choice for the problem that is trying to be solved.

Regardless of which dissimilarity measure is used, it is usually good practice to scale all variables to have a standard deviation of one. This ensures that each variable is given equal importance in the hierarchical clustering that is performed. However, there are cases where this is sometimes not desired.

In clustering, decisions have to be made that ultimately have big consequences on the final result obtained.

- Should features be standardized to have a mean of zero and standard deviation of one?
- In hierarchical clustering, what dissimilarity measure should be used? What type of linkage should be used? Where should the dendogram be cut to obtain the clusters?
- In K-means clustering, how many clusters should we look for in the data?

In practice, we try several different choices and look for the one with the most interpretable solution or useful solution. There simply is no “right” answer. Any solution that leads to something interesting should be considered.

Any time that clustering is performed on a dataset, we will obtain clusters. But what we really want to know is whether or not the obtained clusters represent true subgroups in our data. If we obtained a new and independent dataset, would we obtain the same clusters? This is a tough question to answer, and there isn’t a consensus on the best approach to answering it.

K-means and hierarchical clustering will assign very observation to some cluster, which may cause some clusters to get distorted due to outliers that actually do not belong to any cluster.

Additionally, clustering methods are not robust to data changes. For example, if we performed clustering on a full dataset, and then performed clustering on a subset of the data, we likely would not obtain similar clusters.

In practice, we should not report the results of clustering as absolute truth about a dataset. We should indicate the results as a starting point for a hypothesis and further study, preferable on more independent datasets.

*Originally published at* https://www.bijenpatel.com *on August 10, 2020.*

I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!

```
library(ISLR)
library(MASS)
library(ggplot2)
library(gridExtra) # For side-by-side ggplots
library(e1071)
library(caret)
# We will use the USArrests dataset
head(USArrests)
# Analyze the mean and variance of each variable
apply(USArrests, 2, mean)
apply(USArrests, 2, var)
# The prcomp function is used to perform PCA
# We specify scale=TRUE to standardize the variables to have mean 0 and standard deviation 1
USArrests_pca = prcomp(USArrests, scale=TRUE)
# The rotation component of the PCA object indicates the principal component loading vectors
USArrests_pca$rotation
# Easily plot the first two principal components and the vectors
# Setting scale=0 ensures that arrows are scaled to represent loadings
biplot(USArrests_pca, scale=0)
# The variance explained by each principal component is obtained by squaring the standard deviation component
USArrests_pca_var = USArrests_pca$sdev^2
# The proportion of variance explained (PVE) of each component
USArrests_pca_pve = USArrests_pca_var/sum(USArrests_pca_var)
# Plotting the PVE of each component
USArrests_pca_pve = data.frame(Component=c(1, 2, 3, 4), PVE=USArrests_pca_pve)
ggplot(USArrests_pca_pve, aes(x=Component, y=PVE)) + geom_point() + geom_line()
# Next, we perform PCA on the NCI60 data, which contains gene expression data from 64 cancer lines
nci_pca = prcomp(NCI60$data, scale=TRUE)
NCI60_scaled = scale(NCI60$data) # Alternatively, we could have scaled the data
# Create a data frame of the principal component scores and the cancer type labels
nci_pca_x = data.frame(nci_pca$x)
nci_pca_x$labs = NCI60$labs
# Plot the first two principal components in ggplot
ggplot(nci_pca_x, aes(x=nci_pca_x[,1], y=nci_pca_x[,2])) + geom_point(aes(col=nci_pca_x$labs))
# Similar cancer types have similar principal component scores
# This indicates that similar cancer types have similar gene expression levels
# Plot the first and third principal components in ggplot
ggplot(nci_pca_x, aes(x=nci_pca_x[,1], y=nci_pca_x[,3])) + geom_point(aes(col=nci_pca_x$labs))
# Plot the PVE of each principal component
nci_pca_pve = 100*nci_pca$sdev^2/sum(nci_pca$sdev^2)
nci_pca_pve_df = data.frame(Component=c(1:64), PVE=nci_pca_pve)
ggplot(nci_pca_pve_df, aes(x=Component, y=PVE)) + geom_point() + geom_line()
# Dropoff in PVE after the 7th component</span>
```

```
# First, we will perform K-Means Clustering on a simulated dataset that truly has two clusters
# Create the simulated data
set.seed(2)
x = matrix(rnorm(50*2), ncol=2)
x[1:25, 1] = x[1:25, 1]+3
x[1:25, 2] = x[1:25, 2]-4
# Plot the simulated data
plot(x)
# Perform K-Means Clustering by specifying 2 clusters
# The nstart argument runs K-Means with multiple initial cluster assignments
# Running with multiple initial cluster assignments is desirable to minimize within-cluster variance
kmeans_example = kmeans(x, 2, nstart=20)
# The tot.withinss component contains the within-cluster variance
kmeans_example$tot.withinss
# Plot the data and the assigned clusters through the kmeans function
plot(x, col=(kmeans_example$cluster+1), pch=20, cex=2)
# Next, we perform K-Means clustering on the NCI60 data
set.seed(2)
NCI60_kmeans = kmeans(NCI60_scaled, 4, nstart=20)
# View the cluster assignments
NCI60_kmeans_clusters = NCI60_kmeans$cluster
table(NCI60_kmeans_clusters, NCI60$labs)</span>
```

```
# We will continue using the simulated dataset to perform hierarchical clustering
# We perform hierarchical clustering with complete, average, and single linkage
# For the dissimilarity measure, we will use Euclidean distance through the dist function
hclust_complete = hclust(dist(x), method="complete")
hclust_average = hclust(dist(x), method="complete")
hclust_single = hclust(dist(x), method="single")
# Plot the dendograms
plot(hclust_complete, main="Complete Linkage")
plot(hclust_average, main="Average Linkage")
plot(hclust_single, main="Single Linkage")
# Determine the cluster labels for different numbers of clusters using the cutree function
cutree(hclust_complete, 2)
cutree(hclust_average, 2)
cutree(hclust_single, 2)
# Scale the variables before performing hierarchical clustering
x_scaled = scale(x)
hclust_complete_scale = hclust(dist(x_scaled), method="complete")
plot(hclust_complete_scale)
# Use correlation-based distance for dissimilarity instead of Euclidean distance
# Since correlation-based distance can only be used when there are at least 3 features, we simulate new data
y = matrix(rnorm(30*3), ncol=3)
y_corr = as.dist(1-cor(t(y)))
hclust_corr = hclust(y_corr, method="complete")
plot(hclust_corr)
kmeans_example = kmeans(x, 2, nstart=20)
# The tot.withinss component contains the within-cluster variance
kmeans_example$tot.withinss
# Plot the data and the assigned clusters through the kmeans function
plot(x, col=(kmeans_example$cluster+1), pch=20, cex=2)
# Next, we perform hierarchical clustering on the NCI60 data
# First, we standardize the variables
NCI60_scaled = scale(NCI60$data)
# Perform hierarchical clustering with complete linkage
NCI60_hclust = hclust(dist(NCI60_scaled), method="complete")
plot(NCI60_hclust)
# Cut the tree to yield 4 clusters
NCI60_clusters = cutree(NCI60_hclust, 4)
# View the cluster assignments
table(NCI60_clusters, NCI60$labs)</span>
```

]]>This is a summary of chapter 9 of the *Introduction to Statistical Learning* textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at my website, or here at Hashnode. Subscribe to stay up to date on my latest Data Science & Engineering guides!

*Support vector machines* (SVMs) are often considered one of the best “out of the box” classifiers, though this is not to say that another classifier such as logistic regression couldn’t outperform an SVM.

The SVM is a generalization of a simple classifier known as the *maximal margin classifier*. The maximal margin classifier is simple and intuitive, but cannot be applied to most datasets because it requires classes to be perfectly separable by a boundary. Another classifier known as the *support vector classifier* is an extension of the maximal margin classifier, which can be applied in a broader range of cases. The support vector machine is a further extension of the support vector classifier, which can accommodate non-linear class boundaries.

SVMs are intended for the binary classification setting, in which there are only two classes.

In a *p*-dimensional space, a hyperplane is a flat subspace of dimension *p* — 1. For example, in a two-dimensional setting, a hyperplane is a flat one-dimensional subspace, which is also simply known as a line. A hyperplane in a *p*-dimensional setting is defined by the following equation:

Any point *X* in *p*-dimensional space that satisfies the equation is a point that lies on the hyperplane. If some point *X* results in a value greater than or less than 0 for the equation, then the point lies on one of the sides of the hyperplane.

In other words, a hyperplane essentially divides a *p*-dimensional space into two parts.

Suppose that we had a training dataset with *p* predictors and *n* observations. Additionally, the observations were either associated with one of two classes. Suppose that we also had a separate test dataset. Our goal is to develop a classifier based on the training data, and to classify the test data. How can we classify the data based on the concept of the separating hyperplane?

Assume that it is possible to create a hyperplane that separates the training data perfectly according to their class labels. We could use this hyperplane as a natural classifier. An observation from the test dataset would be assigned to a class, depending on which side of the hyperplane it is located. The determination is made by plugging the test observation into the hyperplane equation. If the value is greater than 0, it is assigned to the class corresponding to that side. If the value is less than 0, then it is assigned to the other class.

However, when we can perfectly separate the classes, many possibilities exist for the hyperplane. The following chart shows an example where two hyperplanes separate the classes perfectly.

This is where the maximal margin classifier helps determine the hyperplane to use.

The maximal margin classifier is a separating hyperplane that is farthest from the training observations.

The method involves determining the perpendicular distance from each training observation to some hyperplane. The smallest such distance is known as the margin. The maximal margin classifier settles on the hyperplane for which the margin is largest. In other words, the chosen hyperplane is the one that has the farthest minimum distance to the training observations.

The maximal margin classifier is often successful, but can lead to overfitting when we have a lot of predictors in our dataset.

The points that end up supporting the maximal margin hyperplane are known as *support vectors*. If these points are moved even slightly, the maximal margin hyperplane would move as well. The fact that the maximal margin hyperplane depends only on a small subset of observations is an important property that will also be discussed in the sections on support vector classifiers and support vector machines.

The maximal margin hyperplane is the solution to an optimization problem with three components:

*M* is the margin of the hyperplane. The second component is a constraint that ensures that the perpendicular distance from any observation to the hyperplane is given by the following:

The third component guarantees that each observation will be on the correct side of the hyperplane, with some cushion *M*.

The maximal margin classifier is a natural way to perform classification, but only if a separating hyperplane exists. However, that is usually not the case in real-world datasets.

The concept of the separating hyperplane can be extended to develop a hyperplane that *almost* separates the classes. This is done by using a *soft margin*. The generalization of the maximal margin classifier to the non-separable case is known as the supper vector classifier.

In most cases, we usually don’t have a perfectly separating hyperplane for our datasets. However, even if we did, there are cases where it wouldn’t be desirable. This is due to sensitivity issues from individual observations. For example, the addition of a single observation could result in a dramatic change in the maximal margin hyperplane.

Therefore, it is usually a good idea to consider a hyperplane that does *not* perfectly separate the classes. This provides two advantages:

- Greater robustness to individual observations
- Better classification of most of the training observations

In other words, it is usually worthwhile to misclassify a few training observations in order to do a better job of classifying the other observations. This is what the support vector classifier does. It allows observations to be on the wrong side of the margin, and even the wrong side of the hyperplane.

The support vector classifier will classify a test observation depending on what side of the hyperplane that it lies. The hyperplane is the solution to an optimization problem that is similar to the one for the maximal margin classifier.

*M* is the margin of the hyperplane. The second component is a constraint that ensures that the perpendicular distance from any observation to the hyperplane is given by the following:

*ϵ* is a *slack variable* that allows observations to be on the wrong side of the margin or hyperplane. It tells us where the *i*-*th* observation is located, relative to the hyperplane and margin.

- If
*ϵ*= 0, the observation is on the correct side of the margin - If
*ϵ*> 0, the observation is on the wrong side of the margin - If
*ϵ*> 1, the observation is on the wrong side of the hyperplane

*C* is a nonnegative tuning parameter that bounds the sum of the *ϵ* values. It determines the number and severity of violations to the margin and hyperplane that will be tolerated.

In other words, *C* is a budget for the amount that the margin can be violated by the *n* observations. If *C* = 0, then there is no budget, and the result would simply be the same as the maximal margin classifier (if a perfectly separating hyperplane exists). If *C* > 0, no more than *C* observations can be on the wrong side of the hyperplane because *ϵ* > 1 in those cases, and the constraint from the fourth component doesn’t allow for it. As the budget increases, more violations to the margin are tolerated, and so the margin becomes wider.

It should come as no surprise that *C* is usually chosen through cross-validation. *C* controls the bias-variance tradeoff. When *C* is small, the classifier is highly fit to the data, resulting in high variance. When *C* is large, the classifier may be too general and oversimplified for the data, resulting in high bias.

In support vector classifiers, the support vectors for the hyperplane are a bit different than the ones from the maximal margin hyperplane. They are the observations that lie directly on the margin and the wrong side of the margin. The larger the value of *C*, the more support vectors there will be.

The fact that the supper vector classifier is based only on a small subset of the training data means that it is robust to the behavior of observations far from the hyperplane. This is different from other classification methods such as linear discriminant analysis, where the mean of all observations within a class help determine the boundary. However, support vector classifiers are similar to logistic regression because logistic regression is not very sensitive to observations far from the decision boundary.

First, we will discuss how a linear classifier can be converted into a non-linear classifier. Then, we’ll talk about support vector machines, which do this in an automatic way.

The support vector classifier is a natural approach for classification in the binary class setting, if the boundary between the classes is linear. However, there are many cases in practice where we need a non-linear boundary.

In chapter 7, we were able to extend linear regression to address non-linear relationships by enlarging the feature space by using higher-order polynomial functions, such as quadratic and cubic terms. Similarly, non-linear boundaries can be created through the use of higher-order polynomial functions. For example, we could fit a support vector classifier using each predictor and its squared term:

This would change the optimization problem to become the following:

However, the problem with enlarging the feature space is that there are many ways to do so. We could use cubic or even higher-order polynomial functions. We could add interaction terms. Many possibilities exist, which could lead to inefficiency in computation. Support vector machines allow for enlarging the feature space in a way that leads to efficient computations.

The *support vector machine* is an extension of the support vector classifier that enlarges the feature space by using *kernels*. Before we talk about kernels, let’s discuss the solution to the support vector classifier optimization problem.

The details of how the support vector classifier is computed is highly technical. However, it turns out that the solution only involves the inner products of the observations, instead of the observations themselves. The inner product of two vectors is illustrated as follows:

The linear support vector classifier can be represented as:

There are *n* parameters (*α*ᵢ) per training observation. The parameters are estimated with the inner products between all pairs of training observations.

To evaluate the support vector classifier function *f*(*x*), we compute the inner product between a new observation *x* and each training observation *x*ᵢ. However, the *α*ᵢ parameters are nonzero only for the support vectors. In other words, if an observation is not a support vector, then its *α*ᵢ is zero. If we represent *S* as the collection of the support vectors, then the solution function can be rewritten as the following:

A kernel is a function that quantifies the similarity of two observations, and is a generalization of the inner product. The kernel function used in the support vector classifier is simply

This is known as a linear kernel because the support vector classifier results in a linear boundary. The linear kernel determines the similarity of observations by using the Pearson correlation.

Instead of using the linear kernel, we could use a polynomial kernel:

Using a nonlinear kernel results in a non-linear decision boundary. When the support vector classifier is combined with a nonlinear kernel, it results in the support vector machine. The support vector machine takes on the form:

The polynomial kernel is just one example of a nonlinear kernel. Another common choice is the radial kernel, which takes the following form:

The advantage to using kernels is that the computations can be performed without explicitly working in the enlarged feature space. For example, in the polynomial kernel, we are simply determining the summation of the inner products, and *then* transforming our result to a higher degree or dimension *d*. This is known as the kernel trick.

The concept of separating hyperplanes upon which SVMs are based does not lend itself naturally to more than two classes. However, there are a few approaches for extending SVMs beyond the binary class setting. The most common approaches are *one-versus-one* and *one-versus-all*.

A *one-versus-one* or *all-pairs* approach develops multiple SVMs, each of which compares a pair of classes. Test observations are classified in each of the SVMs. In the end, we count the number of times that the test observation is assigned to each class. The class that the observation was assigned to most is the class assigned to the test observation.

The one-versus-all approach develops multiple SVMs, each of which compares one class to all of the other classes. Assume that each SVM resulted in the following parameters from comparing some class k to all of the others:

Let x represent a test observation. The test observation is assigned to the class for which the following is the largest:

As previously mentioned, only the support vectors end up playing a role in the support vector classifier that is obtained. This is because the loss function is exactly zero for observations that are on the correct side of the margin.

The loss function for logistic regression is not exactly zero anywhere. However, it is very small for observations that are far from the decision boundary.

Due to the similarities between their loss functions, support vector classifiers and logistic regression often give similar results. However, when the classes are well separated, support vector machines tend to perform better. In cases where there is more overlap, logistic regression tends to perform better. In any case, both should always be tested, and the method that performs best should be chosen.

*Originally published at* https://www.bijenpatel.com *on August 9, 2020.*

I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!

```
library(ISLR)
library(MASS)
library(e1071)
# We will generate a random dataset of observations belonging to 2 classes
set.seed(1)
x=matrix(rnorm(20*2), ncol=2)
y=c(rep(-1, 10), rep(1, 10))
x[y==1,]=x[y==1,] + 1
plot(x, col=(3-y))
# To use SVM, the response must be encoded as a factor variable
data = data.frame(x=x, y=as.factor(y))
# Fit a Support Vector Classifier with a cost of 10
# The scale argument is used to scale predictors
# In this example, we will not scale them
svmfit = svm(y~., data=data, kernel="linear", cost=10, scale=FALSE)
# Plot the fit
plot(svmfit, data)
# Determine which observations are the support vectors
svmfit$index
# Fit an SVM with a smaller cost of 0.1
svmfit = svm(y~., data=data, kernel="linear", cost=0.1, scale=FALSE)
# The e1071 library contains a tune function
# The function performs cross-validation with different cost values
set.seed(1)
tune.out = tune(svm, y~., data=data, kernel="linear", ranges=list(cost=c(0.001, 0.01, 0.1, 1, 5, 10)))
# Check the summary to see the error rates of the different models
# The model with a cost of 0.1 has the lowest error
summary(tune.out)
# Choose the best model
bestmod = tune.out$best.model
summary(bestmod)
# The predict function can be used to predict classes on a set of test observations
xtest = matrix(rnorm(20*2), ncol=2)
ytest = sample(c(-1, 1), 20, rep=TRUE)
xtest[ytest==1,]=xtest[ytest==1,] + 1
testdata=data.frame(x=xtest, y=as.factor(ytest))
ypred = predict(bestmod, testdata)
table(predict=ypred, truth=testdata$y)</span>
```

```
# Now, we will fit a Support Vector Machine model
# We can do this by simply using a non-linear kernel in the svm function
# Generate a dataset with a non-linear class boundary
set.seed(1)
x=matrix(rnorm(200*2), ncol=2)
x[1:100,]=x[1:100,]+2
x[101:150,]=x[101:150,]-2
y=c(rep(1, 150), rep(2, 50))
data = data.frame(x=x, y=as.factor(y))
plot(x, col=y)
# Split the data into training and test sets
train = sample(200, 100)
# Fit an SVM with a radial kernel
svmfit=svm(y~., data=data[train,], kernel="radial", gamma=1, cost=1)
plot(svmfit, data[train,])
# Perform cross-validation using the tune function to test different choices for cost
set.seed(1)
tune.out = tune(svm, y~., data=data[train,], kernel="radial",
ranges=list(cost=c(0.1, 1, 10, 100, 1000),
gamma=c(0.5, 1, 2, 3, 4)))
summary(tune.out)
# Cost of 1 and Gamma of 2 has the lowest error
# Test the model on the test dataset
table(true=data[-train,"y"], pred=predict(tune.out$best.model, newx=data[-train,]))</span>
```

]]>This is a summary of chapter 8 of the *Introduction to Statistical Learning* textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at my website, or here at Hashnode. Subscribe to stay up to date on my latest Data Science & Engineering guides!

Tree-based methods for regression and classification involve segmenting the predictor space into a number of simple regions. To make a prediction for an observation, we simply use the mean or mode of the training observations in the region that it belongs to. Since the set of splitting rules used to segment the predictor space can be summarized in a tree, these approaches are known as decision tree methods.

These methods are simple and useful for interpretation, but not competitive in terms of prediction accuracy when compared with the methods from the chapter on linear model selection and regularization. However, advanced methods such as bagging, random forests, and boosting can result in dramatic improvements at the expense of a loss in interpretation.

Assume that we had a baseball salary dataset that consisted of the number of years that a player was in the league, the number of hits in the previous year, and each player’s log-transformed salary.

We might get a decision tree that looks like the following:

The interpretation of the tree might look as follows:

- Years is the most important factor in determining salary — players with less experience earn less money
- If a player is experienced (more than 5 years in the league), then the number of hits plays a role in the salary

The tree is probably an oversimplification of the relationship, but is easy to interpret and has a nice graphical representation.

In general, the process of building a regression tree is a two-step procedure:

- Divide the predictor space (the set of possible values for all
*X*ⱼ into*j*distinct and non-overlapping regions (*R*₁,*R*₂,*R*₃ …*R*ⱼ) - For every observation that falls into some region, we make the same prediction, which is the mean of the response values in the region

The regions are constructed by dividing the predictor space into “box” shapes for simplicity and ease of interpretation. The goal is to find boxes that minimize the RSS given by the following:

Unfortunately, it is computationally infeasible to consider every partition of the feature space. Therefore, we take a greedy top-down approach known as recursive binary splitting. First, we consider all of the predictors and all of the possible cutpoints (*s*) for each of the predictors and choose the predictor and cutpoint that results in the lowest RSS. Next, the process is repeated to further split the data, but this time we split one of the two previously identified regions, which results in 3 total regions. The process continues until some stopping criteria is reached.

However, the process may end up overfitting the data and result in a tree that is too complex. A smaller tree would result in lower variance and easier interpretation at the cost of a little more bias. Therefore, we should first grow a large complex tree, and then prune it back to obtain a subtree.

This is done through a method known as cost complexity pruning, or weakest link pruning. The method allows us to consider a small set of subtrees instead of every possible subtree. This is done through the addition of a tuning parameter, *α*. For each value of *α*, there is a subtree *T* such that the following is as small as possible:

- Where |
*T*| is the number of terminal nodes - Where
*R*is the rectangle corresponding to the*m*-*th*terminal node

The *α* tuning parameter controls a tradeoff between tree complexity and its fit to the data. As *α* increases, there is a price to pay for having a tree with many terminal nodes, so the equation will typically be minimized for a smaller subtree. This is similar to the lasso regression method, which controlled the complexity of the linear model. As *α* is increased from zero, the branches get pruned in a nested and predictable fashion, so getting the subtree as a function of *α* is simple. The final value of *α* is selected using cross-validation.

Classification trees are similar to regression trees, but predict qualitative responses. In regression, the prediction for some observation is given by the mean response of the observations at some terminal node. In classification, the prediction is given by the most commonly occurring class at the terminal node.

In classification, instead of using the RSS, we look at either the Gini index or the cross-entropy. Both are very similar measures, where small values indicate node purity, meaning that a node contains predominantly observations from a single class. Either of these measures can be used to build a tree and evaluate splits.

When it comes to pruning a tree, we could either use the Gini index, cross-entropy, or the classification error rate. The classification error rate is the fraction of observations in a region that do not belong to the most common class. The classification error rate may be preferable for the purposes of prediction accuracy and tree pruning.

In a dataset, we may have a qualitative variable that takes on more than two values, such as an ethnicity variable.

In a decision tree, these variables are typically split through the use of letters or numbers that are assigned to the values of the qualitative variable.

The advantage of decision trees over other methods is their simplicity. Decision trees are very easy to explain to others, and closely mirror human decision-making. They can be displayed graphically and easily interpreted by non-experts.

The disadvantage is that the ease of interpretation comes at the cost of prediction accuracy. Usually, other regression and classification methods are more accurate.

However, advanced methods such as bagging, random forests, and boosting can greatly improve predictive performance at the cost of interpretability.

Bagging is a general purpose method for reducing the variance of a statistical learning method. It is very useful for decision trees because they suffer from high variance.

A natural way to reduce variance and increase prediction accuracy is to take many training sets from the population, build separate trees using each dataset, and average the predictions. However, it isn’t practical to do this because we usually never have access to multiple different training datasets. Instead, we can use the bootstrap method to generate B different bootstrapped training datasets, create B trees, and average the predictions. In classification, instead of averaging the predictions, we take a majority vote, where the overall prediction is the most occurring class.

In bagging, the trees are grown large and are not pruned. Additionally, there is a straightforward way to estimate the test error of a bagged model without needing to perform cross-validation. Remember that bootstrapped datasets are created through resampling, which allows for observations to be repeated. As it turns out, on average, each bagged tree will make use of two-thirds of the total observations because of the resampling. The remaining one-third of the observations not used to fit a tree are referred to as the out-of-bag (OOB) observations. We predict the response for the *i*-*th* observation using each tree in which the *i*-*th* observation was OOB and average the predictions or take a majority vote. This is done for all of the observations in the original dataset, and the overall test error rate is then determined.

Bagging improves prediction accuracy at the expense of interpretability. However, we can obtain an overall summary of the importance of each predictor using RSS or the Gini index. This is done by recording the total amount that the RSS or Gini index decreases due to splits in a given predictor and averaging over all of the trees. The larger the total value, the more significant the predictor.

Random forests provide an improvement over bagging by decorrelating the trees. The random forest method is similar to bagging, where a large number of trees are built on bootstrapped datasets. However, when building the trees, each time a split is considered, a random sample of *m* predictors is chosen as split candidates from the full set of *p* predictors, and the split can only use one of the *m* predictors. Additionally, a fresh set of m predictors is taken at each split. Usually, we use *m* = sqrt(*p*).

Why might only considering a subset of the predictors at each split be a good thing? Suppose that we have a dataset with one very strong predictor and some other moderately strong predictors. In the bagged trees, almost every tree will use the very strong predictor as the top split. All of the bagged trees will look very similar and be highly correlated. Averaging many highly correlated trees does not result in a large reduction to variance. Since random forests force each split to consider only a subset of the predictors, many splits will not even consider the strongest predictor. Basically, this decorrelates the trees and reduces variance.

Like bagging, boosting is a general approach that can be applied to many different statistical learning methods. Boosting involves building many trees, but each tree is grown sequentially. This means that each tree is grown using information from a previously grown tree. Additionally, boosting does not involve bootstrap sampling. Each tree is fit on a modified version of the original dataset. Boosting works as follows:

The boosting approach learns slowly. If we set the parameter *d* to be small, we fit small trees to the residuals and slowly improve the function in areas where it does not perform well. The shrinkage parameter *λ* slows the process down even further, allowing more and different shaped trees to attack the residuals.

In boosting, unlike in bagging, the construction of each tree depends strongly on the trees already grown. Additionally, boosting could potentially overfit the data if the number of trees B is too large. Therefore, we use cross-validation to select B. Typically, *λ* is between 0.01 and 0.001, and *d*=1 is usually sufficient.

*Originally published at* https://www.bijenpatel.com *on August 8, 2020.*

I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!

```
library(ISLR)
library(MASS)
library(ggplot2)
library(gridExtra) # For side-by-side ggplots
library(e1071)
library(caret)
library(tree) # For decision trees
# We will work with the Carseats data
# It contains data on sales of carseats for 400 different store locations
head(Carseats)
# We will build a tree that predicts whether or not a location had "high" sales
# We define "high" as sales exceeding 8000 units
# Create a vector to indicate whether or not a location had "high" sales
High = ifelse(Carseats$Sales <= 8, "No", "Yes")
# Merge the vector to the entire Carseats data
Carseats = data.frame(Carseats, High)
# The tree function is used to fit a decision tree
Carseats_tree = tree(High ~ . -Sales, data=Carseats)
# Use the summary function to see the predictors used, # of terminal nodes, and training error
summary(Carseats_tree)
# Visualize the tree model
plot(Carseats_tree)
text(Carseats_tree, pretty=0) # pretty=0 shows category names instead of letters
# To estimate test error, we fit a tree to training data and predict on test data
set.seed(2)
train = sample(1:400, 200)
Carseats_train = Carseats[train,]
Carseats_test = Carseats[-train,]
Carseats_tree = tree(High ~ . -Sales, data=Carseats_train)
Carseats_tree_predictions = predict(Carseats_tree, Carseats_test, type="class")
# Use a confusion matrix to determine the accuracy of the tree model
confusionMatrix(Carseats_tree_predictions, Carseats_test$High, positive="Yes")</span>
```

```
# Will pruning the tree lead to improved accuracy?
# The cv.tree function performs cross-validation to determine the optimal tree complexity
set.seed(3)
Carseats_tree_cv = cv.tree(Carseats_tree, FUN=prune.misclass)
# FUN = prune.misclass performs pruning through the misclassification error rate
# The object contains different terminal node values, their error rate, and cost complexity parameter
Carseats_tree_cv
# Create a dataframe of the values from the cv.tree function
Carseats_tree_cv_df = data.frame(Nodes=Carseats_tree_cv$size,
Error=Carseats_tree_cv$dev,
Alpha=Carseats_tree_cv$k)
# Plot the number of terminal nodes, and their corresponding errors and alpha parameters
Carseats_tree_cv_error = ggplot(Carseats_tree_cv_df, aes(x=Nodes, y=Error)) + geom_line() + geom_point()
Carseats_tree_cv_alpha = ggplot(Carseats_tree_cv_df, aes(x=Nodes, y=Alpha)) + geom_line() + geom_point()
# Show the plots side-by-side with the grid.arrange function from gridExtra package
grid.arrange(Carseats_tree_cv_error, Carseats_tree_cv_alpha, ncol=2)
# A tree with 9 terminal nodes results in the lowest error
# This also corresponds to alpha value of 1.75
# Finally, prune the tree with prune.misclass function and specify 9 terminal nodes
Carseats_tree_pruned = prune.misclass(Carseats_tree, best=9)
# Plot the pruned tree
plot(Carseats_tree_pruned)
text(Carseats_tree_pruned, pretty=0)
# Use the pruned tree to make predictions, and compare the accuracy to the non-pruned tree
Carseats_tree_pruned_predictions = predict(Carseats_tree_pruned, Carseats_test, type="class")
confusionMatrix(Carseats_tree_pruned_predictions, Carseats_test$High, positive="Yes")
# 77% accuracy for the pruned tree versus 71.5% for the non-pruned tree
# Pruning results in a better model</span>
```

```
# We will work with the Boston data, which has data on median house values
head(Boston)
# We will build a tree that predicts median house values
# First, create training and test datasets
set.seed(1)
train = sample(1:nrow(Boston), nrow(Boston)/2)
Boston_train = Boston[train,]
Boston_test = Boston[-train,]
# Fit a tree to the training data
Boston_tree = tree(medv ~ ., Boston_train)
# See the predictors used, number of terminal nodes, and error
summary(Boston_tree)
# Plot the tree
plot(Boston_tree)
text(Boston_tree, pretty=0)
# Perform cross validation to determine optimal tree complexity
Boston_tree_cv = cv.tree(Boston_tree)
# Create a dataframe of the values from cross-validation
Boston_tree_cv_df = data.frame(Nodes=Boston_tree_cv$size, Error=Boston_tree_cv$dev, Alpha=Boston_tree_cv$k)
# Plot the number of terminal nodes, and their corresponding errors and alpha parameters
Boston_tree_cv_error = ggplot(Boston_tree_cv_df, aes(x=Nodes, y=Error)) + geom_line() + geom_point()
Boston_tree_cv_alpha = ggplot(Boston_tree_cv_df, aes(x=Nodes, y=Alpha)) + geom_line() + geom_point()
grid.arrange(Boston_tree_cv_error, Boston_tree_cv_alpha, ncol=2)
# Cross-validation indicates that a tree with 8 terminal nodes is best
# However, we could choose to use 7, as the error for 7 is essentially nearly the same as 8
# This will result in a simpler tree
Boston_pruned_tree = prune.tree(Boston_tree, best=7)
# Plot the final pruned tree
plot(Boston_pruned_tree)
text(Boston_pruned_tree, pretty=0)
# Use the pruned tree to make predictions on the test data, and determine the test MSE
Boston_tree_predictions = predict(Boston_pruned_tree, Boston_test)
mean((Boston_tree_predictions - Boston_test$medv)^2)</span>
```

```
library(randomForest) # The randomForest package is used for bagging and random forest
# We continue working with the Boston data
# The randomForest function is used to perform both bagging and random forest
# Bagging is a special case of random forest, where all predictors are used
set.seed(1)
Boston_bag = randomForest(medv ~ ., data=Boston_train, mtry=13, importance=TRUE)
# Use the bagged tree to make predictions on the test data
Boston_bag_predictions = predict(Boston_bag, Boston_test)
# Determine the test MSE
mean((Boston_bag_predictions - Boston_test$medv)^2)</span>
```

```
# We continue working with the Boston data
# For random forest, we simply specify a smaller number of predictors in mtry
# Typically, m=(p/3) is used for regression
# Typically, m=sqrt(p) is used for classification
set.seed(1)
Boston_rf = randomForest(medv ~ ., data=Boston_train, mtry=round(13/3), importance=TRUE)
# Use the bagged tree to make predictions on the test data
Boston_rf_predictions = predict(Boston_rf, Boston_test)
# Determine the test MSE and compare to the result from bagging
mean((Boston_rf_predictions - Boston_test$medv)^2)
# Lower test MSE than result from bagging
# The importance function can be used to see the importance of each variable
# The first column indicates how much the error increases if the variable is excluded from the model
# The second column indicates how much node purity decreases if the variable is excluded from the model
importance(Boston_rf)
# Quick plot of the data from the importance function
varImpPlot(Boston_rf)
# lstat and rm are the most importance variables</span>
```

```
library(gbm) # For boosting
# We continue working with the Boston data
# The gbm function is used to perform boosting
# For regression problems, the distribution is set to gaussian
# For classification problems, the distribution is set to bernoulli
# n.trees is used to specify the number of trees
# interaction.depth is used to limit the depth of each tree
set.seed(1)
Boston_boost = gbm(medv ~ ., Boston_train, distribution="gaussian", n.trees=5000, interaction.depth=4)
# The summary function shows the relative influence statistics for each variable
summary(Boston_boost)
# Plot the marginal effect of variables on the response after integrating out the other variables
plot(Boston_boost, i="lstat")
plot(Boston_boost, i="rm")
# Use the boosted model to predict on the test data
Boston_boost_predictions = predict(Boston_boost, Boston_test, n.trees=5000, interaction.depth=4)
# Determine the test MSE
mean((Boston_boost_predictions - Boston_test$medv)^2)
# Boosting results in a test MSE that is slightly better than random forest</span>
```

]]>This is a summary of chapter 7 of the *Introduction to Statistical Learning* textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at my website, or here at Hashnode. Subscribe to stay up to date on my latest Data Science & Engineering guides!

Linear models are advantageous when it comes to their interpretability. However, their capabilities are limited, especially in scenarios where the linear assumption is poor. Ridge, lasso, and principal components regression improve upon the least squares regression model by reducing the variance of the coefficient estimates. However, these models are still linear, and will perform poorly in nonlinear settings. We can move beyond linearity through methods such as polynomial regression, step functions, splines, local regression, and generalized additive models.

Polynomial regression is the simplest method of extending linear regression. It involves adding extra predictors, which are just the original predictors raised to some exponent. A polynomial regression model may look like the following:

A basic polynomial regression model was introduced in the linear regression section, where the relationship between automobile MPG and horsepower was modeled with quadratic regression.

For large degree (*d*) values, polynomial regression can produce extremely non-linear curves. However, it is typically unusual to use degree values greater than 3 or 4 because the resulting curve will be too flexible and take on a weird shape. This is especially the case at the boundaries of the *X* predictor.

In polynomial regression, the individual coefficient values are not very important. Instead, we look at the overall fit of the model to assess the relationship between the predictor and response.

Additionally, it is fairly easy to plot a 95% confidence interval for a polynomial regression model. Least squares returns variance estimates for each coefficient, as well as covariances between pairs of coefficients. The estimated pointwise standard error for some predictor value is the square root of the variance. Plotting twice the pointwise standard error for each possible *X* value on both sides of the fitted polynomial curve will result in the 95% confidence interval.

Polynomial regression imposes a global structure on the non-linear function of *X*. But what if we don’t want all of our *X* values to follow a global structure? Step functions can be used to break the range of *X* into bins. Then, a different constant is fit to each bin. The bins are created through simple cutpoints as follows:

*I* is an indicator function that returns a 1 if the condition is true, and 0 if it is false.

After creating the cutpoints, least squares regression is used to fit a linear model that uses the cutpoints as predictors.

Note that *C*₀(*X*)is excluded as a predictor because it is redundant with the intercept. Additionally, for a given value of *X*, only one of the *C* values can be non-zero.

An example of a step function with cutpoints at 10 and 15 is shown in the following graph:

The step function is also known as a piecewise constant regression model.

Polynomial and piecewise constant models are special cases of a basis function approach. The basis function approach utilizes a family of functions or transformations that can be applied to a predictor. The model takes on the following form:

The basis functions (*b*) are fixed and known prior to running the model. In polynomial and piecewise constant regression, the basis functions are as follows:

In the next few sections, we’ll discuss splines, which are very commonly used in the basis function approach.

Regression splines extend upon polynomial regression and piecewise-constant regression.

In polynomial regression, an entire polynomial function is fit over the entire range of *X* values. Piecewise polynomial regression simply involves fitting separate polynomial functions over different ranges of *X*. The points at which the functions change are called knots. Using more knots leads to a more flexible model. For example, a piecewise polynomial regression model with one knot takes on the following form:

An example of a piecewise polynomial regression model with one knot at *X* = 12 can be seen in the following graph:

As can be seen from the chart above, piecewise polynomial models are discontinuous. While discontinuity can sometimes be desired, we usually want a continuous model. To solve this, we can introduce constraints, which result in continuous and smooth models.

In order to produce a piecewise polynomial model that is continuous, we need to introduce a continuity constraint. In other words, the separate polynomial models must meet at the knots. Introducing the continuity constraint on the previous chart might result in something like the following:

However, the point at which the functions join looks a bit unnatural. This can be addressed by introducing another constraint that results in a smooth join. Specifically, this constraint requires that the derivatives up to degree (*d*-1) be continuous at each knot. For example, if we have a cubic polynomial (*d*=3), then the first and second derivatives must be continuous at the knots. A model that is continuous and smooth is known as a spline. Introducing the smoothness constraint on the previous chart might result in the following spline:

Splines have the potential to have high variance at the outer range of the predictor values, which causes wide confidence intervals at the outer boundaries. This can be addressed through a boundary constraint, which requires the spline function to be linear at the boundaries (before the first knot, and after the last knot). A spline with this additional constraint is called a natural spline.

There are many methods that can be used to choose the number of knots to use, and their locations.

One option is to place more knots in the regions where we feel that the function might vary the most, and to place fewer knots where it will be more stable.

The first option could work well, but it is more common to simply place knots at uniform percentiles of the data. For example, if we choose to use three knots, then they would be placed at the 25th, 50th, and 75th percentiles of the data. But how do we know how many knots should be used? One option is to simply try out a different number of knots and analyze the fitted curves. Another more objective option is to use cross-validation to choose the number of knots that results in the lowest test error.

Another option is to use cross-validation to determine the number of knots and their locations.

Regression splines often produce better results than polynomial regression models. This is because polynomial regression requires the use of a high-degree model to produce a very flexible fit. High-degree models usually lead to highly inaccurate predictions at certain *X* values. Splines produce flexible fits by introducing knots and separate low-degree functions, which ultimately results in better and stable predictions.

Smoothing splines are an alternative approach to fitting a smooth curve over a dataset. Smoothing splines find a function *g*(*X*) that minimizes the following:

The above equation should look familiar to the equation that ridge regression and lasso regression aim to minimize. The first part is simply the residual sum of squares. The second part is a penalty term that encourages *g*(*X*) to be smooth. *g*^*m*(*t*) is the *m*-*th* derivative of the function, which is a measure of the roughness of the function at some point. The variable *m* is known as the penalty order, and takes on a value of *m*=1 for a linear smoothing spline, *m*=2 for a cubic smoothing spline, *m*=3 for a quintic (fifth-order) smoothing spline, and *m*=4 for a septic (seventh-order) smoothing spline. If the function is very wiggly at some point, then the *m*-*th* derivative will be large. Therefore, the integral is a measure of the total roughness of the entire function. *λ* is a tuning parameter that ultimately controls the bias-variance trade-off of the smoothing spline. Larger values of *λ* will result in an even smoother spline.

Smoothing splines have some special properties:

- Smoothing splines are piecewise cubic polynomials with knots at each of the unique
*X*values in the dataset. - Smoothing splines are linear at the boundaries (before the first knot, and after the last knot)

The combination of the above two properties means that smoothing splines are natural splines. However, it isn’t the same as the natural spline you would get from a piecewise polynomial function with knots at each unique *X* value and constraints. Instead, it is a shrunken version of that natural cubic spline because of the presence of the penalty term.

So, how is the *λ* parameter chosen? As it turns out, the most efficient method is through leave-one-out cross-validation (LOOCV). In fact, LOOCV essentially has the same computational cost as a single fit of a smoothing spline.

Local regression is yet another approach to fitting flexible non-linear functions. It involves determining a fit at some target point *X*₀ using only the nearby training observations. The process of fitting a local regression model is as follows:

- Gather the fraction (
*s*=*k*/*n*) of training points whose xᵢ values are closest to the target point*X*₀ - Assign a weight
*K*ᵢ₀ =*K*(*X*ᵢ,*X*₀) to each point in this neighborhood, so that the point furthest from X₀ has a weight of zero, and the closest has the highest weight. All but these K nearest neighbors get a weight of zero. - Fit a weighted least squares regression of the
*Y*ᵢ on the*X*ᵢ using the*K*ᵢ₀ weights by finding*β*₀ and*β*₁ that minimize the following:

There are many important choices to be made when performing local regression, such as how to define the weighting function *K*, and whether to fit a linear, quadratic, or other type of regression model. However, the most important choice is the choice of the span (*s*) in step one. The span controls the flexibility of the non-linear fit. The smaller the value of the span, the more local and wiggly it will be. As in most cases, cross-validation is a very useful method for determining the value of the span.

Local regression can be generalized in many different ways. For example, in a setting where we have multiple predictors, one useful generalization involves fitting a multiple linear regression model that is global in some variables, but local in another. This is known as a varying coefficient model. Additionally, local regression generalizes naturally when we want to fit models that are local in multiple predictors instead of just one. However, local regression can perform poorly if we’re trying to fit models that are local in more than 3 or 4 predictors because there will generally be very few training observations close to the target point *X*₀. *K*-Nearest Neighbors regression suffers from this same problem.

What if we wanted to flexibly predict *Y* on the basis of several different *X* predictors? Generalized additive models (GAMs) allow for non-linear functions of each predictor, while maintaining additivity. Additivity simply refers to the ability to add together each predictor’s contribution. GAMs can be used in either the regression or classification setting.

Recall that a multiple linear regression takes on the following form:

To allow for non-linear relationships between each predictor and the response, we can replace each *β*ⱼ*X*ⱼlinear component with a non-linear function *f*ⱼ(*X*ⱼ). This results in a GAM model, which looks as follows:

So far, we’ve discussed different methods for fitting functions to a single predictor, such as step functions, basis functions, and splines. The beauty of GAMs is that these methods can be used as building blocks for fitting an additive model. For example, assume that we wanted to predict salary and had year, age, and education as predictors. The GAM would look as follows:

We could fit natural splines for the year and age predictors, and a step function for the education predictor. The plots of the relationship between each of these predictors and the response may look like the following:

Fitting a GAM with smoothing splines is not simple because least squares cannot be used. However, there is an approach known as backfitting, which can be used to fit GAMs with smoothing splines. Backfitting fits a model involving multiple predictors by repeatedly updating the fit for each predictor in turn, holding the others fixed.

The advantages of GAMs are the following:

- GAMs allow us to fit non-linear functions for each predictor. This allows us to easily model non-linear relationships without having to manually try different transformations on each predictor.
- The non-linear fits can potentially result in more accurate predictions.
- Because GAMs are additive, we can still examine the effect of each predictor on the response, while holding all of the other predictors fixed. This means that GAMs maintain interpretability.
- The smoothness of a function for a predictor can be summarized via degrees of freedom.

The main disadvantage of GAMs is that:

- GAMs are restricted to be additive. This means that variable interactions could be missed. However, interaction terms can be included by adding interaction predictors or interaction functions.

*Originally published at* https://www.bijenpatel.com *on August 7, 2020.*

I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!

```
library(MASS)
library(ISLR)
library(ggplot2)
# Working with the Wage dataset
head(Wage)
# Fit a fourth-degree polynomial model to predict wage with age
# Note that this fits an orthagonal polynomial
fit = lm(wage~poly(age, 4), data=Wage)
coef(summary(fit))
# To fit a regular polynomial, we can use the raw=TRUE argument
# This doesn't affect the model in a meaningful way
fit2 = lm(wage~poly(age, 4, raw=TRUE), data=Wage)
# Create a grid of values for age, which we use to make predictions with
agelims = range(Wage$age)
age.grid = seq(from=agelims[1], to=agelims[2])
# Get predictions and standard error bands
preds = predict(fit, newdata=list(age=age.grid), se=TRUE)
preds = data.frame(preds)
preds$age = age.grid
se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)
se.bands = data.frame(se.bands)
se.bands$age = age.grid
# Plot the data, the model fit, and the standard error bands with ggplot
ggplot(data=Wage, aes(x=Wage$age, y=Wage$wage)) +
geom_point() +
labs(title="Degree-4 Polynomial", x="Age", y="Wage") +
geom_line(data=preds, aes(x=preds$age, y=preds$fit), colour="blue", size=1.5) +
geom_line(data=se.bands, aes(x=se.bands$age, y=se.bands$X1), colour="orange", size=0.75) +
geom_line(data=se.bands, aes(x=se.bands$age, y=se.bands$X2), colour="orange", size=0.75)
# Perform ANOVA to compare models with different degrees
# The more complex model is compared to the simpler model
fit.1 = lm(wage~age, data=Wage)
fit.2 = lm(wage~poly(age, 2), data=Wage)
fit.3 = lm(wage~poly(age, 3), data=Wage)
fit.4 = lm(wage~poly(age, 4), data=Wage)
fit.5 = lm(wage~poly(age, 5), data=Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
# Quadratic model is better than linear model (p-value: 0)
# Cubic model is better than quadratic model (p-vlaue: 0)
# Quartic model is better than cubic model (p-value: 0.05)
# Quintic polynomial is NOT better than quartic model (p-value: 0.37)
# Note that we also could have used cross-validation to choose a model
# Fit a polynomial logistic regression model to predict if someone makes more than $100,000
logistic.fit = glm(I(wage>100)~poly(age, 4), data=Wage, family=binomial)
# Make predictions for all of the ages
logistic.preds = predict(logistic.fit, newdata=list(age=age.grid), se=T, type="response")
logistic.preds = data.frame(logistic.preds)
logistic.preds$age = age.grid
# Fit a simple step function using the cut function
# Cut automatically chooses cutpoints, but cutpoints can be set manually too
step.fit = lm(wage~cut(age, 4), data=Wage)
coef(summary(step.fit))</span>
```

```
library(splines)
# We continue working with the Wage dataset
# Use the basis function bs() to fit a piecewise polynomial spline
# We use pre-specified knots at ages 25, 40, 65
# By default, a cubic spline is produced
fit = lm(wage~bs(age, knots=c(25, 40, 65)), data=Wage)
# Get predictions
preds = predict(fit, newdata=list(age=age.grid), se=T)
preds = data.frame(preds)
preds$age = age.grid
# Get standard error bands
se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)
se.bands = data.frame(se.bands)
se.bands$age = age.grid
# Plot the data, the model fit, and the standard error bands with ggplot
ggplot(data=Wage, aes(x=Wage$age, y=Wage$wage)) +
geom_point() +
labs(title="Piecewise Polynomial Spline", x="Age", y="Wage") +
geom_line(data=preds, aes(x=preds$age, y=preds$fit), colour="blue", size=1.5) +
geom_line(data=se.bands, aes(x=se.bands$age, y=se.bands$X1), colour="orange", size=0.75) +
geom_line(data=se.bands, aes(x=se.bands$age, y=se.bands$X2), colour="orange", size=0.75)
# We could also use the df option in the bs() function to use knots at uniform quantiles
# This will set the knots at the 25th, 50th, and 75th percentiles of the age data
fit = lm(wage~bs(age, df=6), data=Wage)
# We could fit a natural spline instead of using basis functions
fit = lm(wage~ns(age, df=4), data=Wage)
preds = predict(fit, newdata=list(age=age.grid), se=T)
preds = data.frame(preds)
preds$age = age.grid
ggplot(data=Wage, aes(x=Wage$age, y=Wage$wage)) +
geom_point() +
labs(title="Piecewise Polynomial Spline", x="Age", y="Wage") +
geom_line(data=preds, aes(x=preds$age, y=preds$fit), colour="blue", size=1.5)
# We can also use a smoothing spline
fit = smooth.spline(Wage$age, Wage$wage, df=6)
# Lastly, we can use local regression
# We will use a span value of 0.5
fit = loess(wage~age, span=0.5, data=Wage)</span>
```

```
library(gam)
# We continue working with the Wage dataset
# Fit a GAM
# Use a smoothing spline for the year and age variables
# Year spline with 4 df, and age with 5 df
gam.m3 = gam(wage~s(year,4) + s(age, 5) + education, data=Wage)
# Fit a GAM model without the year variable
gam.m1 = gam(wage~s(age, 5) + education, data=Wage)
# Fit a GAM model with a linear year variable instead of a spline
gam.m2 = gam(wage~year + s(age, 5) + education, data=Wage)
# Perform ANOVA tests to compare the different GAM models
anova(gam.m1, gam.m2, gam.m3, test="F")
# The second GAM model is better than the first
# The third GAM model is not better than the second
# Make prediction with the GAM model
preds = predict(gam.m2, newdata=Wage)
# We can also use local regression in GAMs
# Use local regression for the age variable, with a span of 0.7
gam.lo = gam(wage~s(year, df=4) + lo(age, span=0.7) + education, data=Wage)
# We could also use local regression to create an interaction term for the GAM
gam.lo.i = gam(wage~lo(year, age, span=0.5) + education, data=Wage)
# Lastly, we can also perform logistic regression with GAM
gam.lr = gam(I(wage>100)~year + s(age, df=5) + education, family=binomial, data=Wage)</span>
```

]]>This is a summary of chapter 6 of the *Introduction to Statistical Learning* textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at my website, or here at Hashnode. Subscribe to stay up to date on my latest Data Science & Engineering guides!

Linear regression can be improved by replacing plain least squares with some alternative fitting procedures. These alternative fitting procedures may yield better prediction accuracy and model interpretability.

There are three classes of alternative methods to least squares: subset selection, shrinkage, and dimension reduction.

Subset selection involves identifying a subset of the p predictors that are believed to be related to the response, and fitting a least squares model to the reduced set of predictors.

Shrinkage involves fitting a model involving all p predictors. However, the estimated coefficients are shrunken towards zero relative to the least squares estimates. Shrinkage reduces variance, and may perform variable selection.

Dimension reduction involves projecting all of the p predictors into an *M*-dimensional subspace where *M*<*p*. M different “linear combinations” of all of the *p* predictors are computed, and are used as predictors to fit a linear regression model through least squares.

When the number of observations n is not much greater than the number of predictors *p*, least squares regression will tend to have higher variance. In the case that the number of predictors *p* is greater than the number of observations *n*, least squares cannot be used at all.

By shrinking or constraining the estimated coefficients through alternative methods, variance can be substantially reduced at a negligible increase to bias.

Additionally, the least squares approach is highly unlikely to yield any coefficient estimates that are exactly zero. There are alternative approaches that automatically perform feature selection for excluding irrelevant variables from a linear regression model, thus resulting in a model that is easier to interpret.

There are two main types of subset selection methods: best subset selection and stepwise model selection.

Best subset selection involves fitting a separate least squares regression model for each possible combination of the *p* predictors. For example, assume we had a credit balance dataset that looked like the following:

The best subset selection process begins by fitting all models that contain only one predictor:

Next, all possible two-predictor models are fit:

The process is continued for all possible three, four, and five predictor models.

After all possible models are fit, the best one, two, three, four, and five predictor models are chosen based on some criteria, such as the largest *R*² value. For the credit balance dataset, these models may be as follows:

Lastly, the overall best model is chosen from the remaining models. The best model is chosen through cross-validation or some other method that chooses the model with the lowest measure of test error. In this final step, a model cannot be chosen based on *R*², because *R*² always increases when more predictors are added. The plot of the K-Fold CV test errors for the five models looks as follows:

The model with four predictors has the lowest test error, and thus would be chosen as the best overall model.

There are three different types of stepwise selection: forward, backward, and hybrid. Compared to best subset selection, stepwise selection is computationally more feasible. Additionally, it helps avoid the issue of overfitting and high variance.

Forward stepwise selection begins with a model containing no predictors. Predictors are added to the model one at a time, based on the predictor that gives the greatest additional improvement to the fit of the model.

After using forward stepwise selection to come up with the best one, two, three, four, and five predictor models, cross-validation is used to choose the best model.

The disadvantage to forward stepwise selection is that the method does not guarantee to find the best possible model, since the final model is dependent on the first predictor that is added to the null model. The credit balance dataset is a good example of this phenomenon. Best subset selection and forward stepwise selection both choose the same one-, two-, and three-predictor models. However, they differ in the fourth model.

Best subset selection replaces the “Rating” predictor with the “Cards” predictor.

Since forward stepwise selection begins by choosing “Rating” as the first predictor, the predictor is kept:

Backward stepwise selection begins with a model that contains all of the predictors. Predictors are removed from the model one at a time, based on the predictor that is least useful towards the fit of the model.

After using backward stepwise selection to come up with the best one, two, three, and four predictor models, cross-validation is used to choose the best model.

Backward stepwise selection has the same disadvantage as forward stepwise selection in that it is not guaranteed to find the best possible model.

Hybrid stepwise selection is a combination of forward and backward selection. We begin with a null model that contains no predictors. Then, variables are added one by one, exactly as done in forward selection. However, if at any point the *p*-value for some variable rises above a chosen threshold, then it is removed from the model, as done in backward selection.

In order to select the best model with respect to the test error, the test error needs to be estimated through one of two methods:

- An indirect estimate through some kind of mathematical adjustment to the training error, which accounts for bias due to overfitting.
- A direct estimate through a method such as cross-validation.

There are various different methods of indirectly estimating the test error: *Cp*, *AIC*, *BIC*, and adjusted-*R*².

*Cp*

*Cp* is an estimate of test error that adds a penalty to the training error, based on the number of predictors and the estimate of the variance of error. The model with the lowest *Cp* is chosen as the best model.

*AIC*

Similar to *Cp*, *AIC* is another estimate of test error that adds a penalty to the training error, based on the number of predictors and the estimate of the variance of error. However, it is defined for a larger class of models fit by maximum likelihood. The model with the lowest *AIC* is chosen as the best model.

*BIC*

*BIC* is an estimate of test error that is similar to *AIC*, but places a heavier penalty on models with a large number of predictors. Therefore, *BIC* results in the selection of models with less predictors. The model with the lowest *BIC* is chosen as the best model.

**Adjusted- R²**

Adjusted-*R*² is a measure that penalizes the *R*² value for the inclusion of unnecessary predictors. The model with the highest adjusted-*R*² is chosen as the best model.

The model with the absolute lowest test error doesn’t always have to be chosen. For example, what if we had a case where the models with anywhere from 3 to 10 predictors all had similar test errors?

A model could be chosen based on the one-standard-error rule. This rule involves selecting the model with the least number of predictors for which the estimated test error is within one standard error of the model with the absolute minimum error. This is done for simplicity and ease of interpretability.

Shrinkage methods involve fitting a model with all of the available predictors, and shrinking the coefficients towards zero. There are two main shrinkage methods:

- Ridge Regression
- Lasso Regression

Recall that least squares regression minimizes RSS to estimate coefficients. The coefficients are unbiased, meaning that least squares doesn’t take variable significance into consideration when determining the coefficient values.

Instead, ridge regression minimizes the following:

The second term is called a shrinkage penalty, and is the term that shrinks the coefficient estimates towards zero. *λ* is a tuning parameter that controls the relative impact of the penalty term on the regression model. The coefficient estimates that come from ridge regression are biased because variable significance is taken into consideration.

Performing ridge regression with different values of *λ* on the credit balance dataset would result in the following sets of standardized coefficients:

In short, different values of *λ* will produce different sets of coefficient estimates. Therefore, choosing the proper *λ* value is critical, and is usually chosen through cross-validation.

Additionally, in ridge regression, coefficient estimates significantly depend on the scaling of predictors. Therefore, it is best practice to perform ridge regression after standardizing the predictors, so that they are on the same scale.

So, when is it appropriate to use ridge regression over least squares regression?

The advantage of ridge regression is rooted in the bias-variance tradeoff. As *λ* increases, the flexibility of the ridge regression model decreases, resulting in a decrease in variance, but an increase in bias.

Therefore, ridge regression should be used when least squares regression will have high variance. This happens when the number of predictors is almost as large as the number of observations.

If we’re unsure whether or not ridge regression would result in a better predictive model than least squares, we could simply run both methods and compare the test errors to each other. Unless ridge regression provides a significant advantage, least squares should be used for model simplicity and interpretability.

The disadvantage of ridge regression is that all predictors are included in the final model. Even though the penalty term shrinks the coefficient estimates towards zero, it does not set them to exactly zero.

This might not matter in terms of prediction accuracy, but it creates a challenge when it comes to model interpretation because of the large number of predictors.

Lasso regression is an alternative to ridge regression, which overcomes this disadvantage. Lasso regression minimizes the following:

Similar to ridge regression, lasso regression will shrink coefficient estimates towards zero. However, the penalty is a bit different in that it forces some of the coefficient estimates to be exactly zero when the tuning parameter *λ* is large enough. Again, choosing the appropriate *λ* value is critical.

This effectively means that lasso regression performs variable selection, and makes it easier to interpret the final model.

As in ridge regression, it is best practice to perform lasso regression after standardizing the predictors.

In general, lasso regression is expected to perform better than ridge regression when the response *Y* is expected to be a function of only a few of the predictors.

In general, ridge regression is expected to perform better than lasso regression when the response is expected to be a function of a large number of predictors.

Cross-validation should be used to compare both methods and choose the best model.

As mentioned previously, choosing the proper value for the tuning parameter is crucial for coming up with the best model.

Cross-validation is a simple method of choosing the appropriate *λ* value. First, create a grid of different *λ* values, and determine the cross-validation test error for each value. Finally, choose the value that resulted in the lowest error.

When we have a dataset with a large number of predictors, dimension reduction methods can be used to summarize the dataset with a smaller number of representative predictors (dimensions) that collectively explain most of the variability in the data. Each of the dimensions is some linear combination of all of the predictors. For example, if we had a dataset of 5 predictors, the first dimension *Z*₁ would be as follows:

And a final regression model might look as follows:

Note that we will always have less dimensions than the number of predictors.

The *ϕ* values for the dimensions are known as loading values. The values are subject to the constraint that the square of the *ϕ* values in a dimension must equal one. For our above example, this means that:

The *ϕ* values in a dimension make up a “loading vector” that defines a “direction” to explain the predictors. But how exactly do we come up with the *ϕ* values? They are determined through either of the two most common dimension reduction methods:

- Principal Components Analysis
- Partial Least Squares

Principal components analysis is the most popular dimension reduction method. The method involves determining different principal component directions of the data.

The first principal component direction *Z*₁ defines the direction along which the data varies the most. In other words, it is a linear combination of all of the predictors, such that it explains most of the variance in the predictors.

The second principal component direction *Z*₂ defines another direction along which the data varies the most, but is subject to the constraint that it must be uncorrelated with the first principal component, *Z*₁.

The third principal component direction *Z*₃ defines another direction along which the data varies the most, but is subject to the constraint that it must be uncorrelated with both of the previous principal components, *Z*₁ and *Z*₂.

And so on and so forth for additional principal component directions.

Dimension reduction is best explained with an example. Assume that we have a dataset of different baseball players, which consists of their statistics in 1986, their years in the league, and their salaries in the following year (1987).

Our goal is to perform principal components regression to come up with a model that predicts salaries.

First, we need to perform dimension reduction by transforming our 7 different predictors into a smaller number of principal components to use for regression.

It is important to note that prior to performing principal components analysis, each predictor should be standardized to ensure that all of the predictors are on the same scale. The absence of standardization will cause the predictors with high variance to play a larger role in the final principal components obtained.

Performing principal components analysis would result in the following *ϕ* values for the first three principal components:

A plot of the loading values of the first two principal components would look as follows:

How do we interpret these principal components?

If we take a look at the first principal component, we can see that there is approximately an equal weight placed on each of the six baseball statistic predictors, and much less weight placed on the years that a player has been in the league. This means that the first principal component roughly corresponds to a player’s level of production.

On the other hand, in the second principal component, we can see that there is a large weight placed on the number of years that a player has been in the league, and much less weight placed on the baseball statistics. This means that the second principal component roughly corresponds to how long a player has been in the league.

In the third principal component, we can see that there is more weight placed on three specific baseball statistics: home runs, RBIs, and walks. This means that the third principal component roughly corresponds to a player’s batting ability.

Performing principal components analysis also tells us the percent of variation in the data that is explained by each of the components. The first principal component from the baseball data explains 67% of the variation in the predictors. The second principal component explains 15%. The third principal component explains 9%. Therefore, together, these three principal components explain 91% of the variation in the data.

This helps explain the key idea of principal components analysis, which is that a small number of principal components are sufficient to explain most of the variability in the data. Through principal components analysis, we’ve reduced the dimensions of our dataset from seven to three.

The number of principal components to use in regression can be chosen in one of two ways. The first method involves simply using the number of components that explain a large amount of the variation in the data. For example, in the baseball data, the first three principal components explain 91% of variation in the data, so using just the first three is a valid option. The second method involves choosing the number of principal components that results in the regression model with the lowest test error. Typically, both methods should result in the same or similar final models with test errors that do not greatly differ from one another.

Remember that in principal components analysis, the response variable is not used to determine the principal components. Therefore, in principal components regression, we are making the assumption that the directions in which the predictors show the most variation are also the directions that are associated with the response variable.

The assumption is not guaranteed to be true, but it often turns out to be a reasonable assumption to make. When the assumption holds, principal components regression will result in a better model than least squares regression due to mitigation of overfitting.

In general, principal components regression will perform well when the first few principal components are sufficient to capture most of the variation in the predictors, as well as the relationship with the response variable.

However, principal components regression is not a feature selection method because each of the principal components is a linear combination of all of original predictors in the dataset. If performing feature selection is important, then another method such as stepwise selection or lasso regression should be used.

In principal components regression, the directions that best represent the predictors are identified in an unsupervised way since the response variable is not used to help determine the directions. Therefore, there is no guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.

Partial least squares regression is a supervised alternative to principal components regression. In other words, partial least squares regression attempts to find directions that help explain both the response and the predictors.

Partial least squares works by first standardizing the predictors and response, and then determining the first direction by setting each *ϕ* value equal to the coefficients from simple linear regression of the response onto each of the predictors. In doing so, partial least squares places the highest weight on the variables most strongly related to the response.

To identify the second direction, each of the original variables is adjusted by regressing each variable onto the first direction and taking the residuals. These residuals represent the remaining information that isn’t explained by the first direction. The second direction is then determined by using the new residual data, in the same method that the first direction was determined based on the original data. This iterative approach is repeated to find multiple directions, which are then used to fit a linear model to predict the response.

The number of directions to use in partial least squares regression should be determined through cross-validation.

However, partial least squares regression usually does not perform better than ridge regression or principal components regression. This is because supervised dimension reduction can reduce bias, but also has the potential to increase variance.

Today, it is common to be able to collect hundreds or even thousands of predictors for our data. The high dimensional setting refers to situations where the number of predictors exceeds the number of observations we have available. In the high dimensional setting, least squares cannot be used because it will result in coefficient estimates that are a perfect fit to the data, such that the residuals are zero. This is where subset selection, ridge regression, lasso, principal components regression, and partial least squares should be used instead.

However, the interpretation of model results is a bit different. For example, assume that we are trying to predict blood pressure from hundreds of predictors. A forward stepwise selection model might indicate that 20 of the predictors lead to a good predictive model. It would actually be improper to conclude that these 20 predictors are the best to predict blood pressure. Since we have hundreds of predictors, a different dataset might actually result in a totally different predictive model. Therefore, we must indicate that we have identified one of the many possible models, and it must be further validated on independent datasets.

Additionally, SSE, *p*-values, *R*², and other traditional measures of model fit should never be used in the high dimensional setting. Instead, report the results of the model on an independent test dataset, or cross-validation errors.

*Originally published at* https://www.bijenpatel.com *on August 6, 2020.*

I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!

```
library(ISLR)
library(MASS)
library(ggplot2)
library(dplyr)
library(glmnet) # For ridge regression and lasso models
# We will work with the Hitters dataset
head(Hitters)
# First, we create a dataframe of only the predictors
# The model.matrix function will automatically transform qualitative variables into dummy variables
Hitters_predictors = model.matrix(Salary ~., data=Hitters)[,-1]
# Next, we create a vector of only the responses
Hitters_responses = Hitters$Salary
# Next, we further split the datasets into training and test subsets
set.seed(1)
train = sample(1:nrow(Hitters), nrow(Hitters)/2)
Hitters_train_predictors = Hitters_predictors[train,]
Hitters_test_predictors = Hitters_predictors[-train,]
Hitters_train_responses = Hitters_responses[train]
Hitters_test_responses = Hitters_responses[-train]
# Next, we use cv.glmnet function to perform ridge regression with 10-fold cross-validation
# alpha = 0 specifies ridge regression
set.seed(1)
Hitters_ridge_cv = cv.glmnet(Hitters_train_predictors, Hitters_train_responses, alpha=0, nfolds=10)
# The lambda.min component of the object contains the best lambda value to use
lambda_min = Hitters_ridge_cv$lambda.min
# Finally, we fit a ridge regression model to the training data with the minimum lambda value
# And then determine the test error
Hitters_ridge = glmnet(Hitters_train_predictors, Hitters_train_responses, alpha=0, lambda=lambda_min)
Hitters_ridge_predictions = predict(Hitters_ridge, s=lambda_min, newx=Hitters_test_predictors)
mean((Hitters_ridge_predictions - Hitters_test_responses)^2)
# See the coefficients of the fitted ridge regression model
coef(Hitters_ridge)</span>
```

```
# We continue working with the Hitters data
# To fit a lasso model instead of a ridge regression model, we use alpha = 1 instead of alpha = 0
# The same methodology is used to determine a lambda value and the test MSE
set.seed(1)
# Perform lasso with 10-fold cross-validation
Hitters_lasso_cv = cv.glmnet(Hitters_train_predictors, Hitters_train_responses, alpha=1, nfolds=10)
lambda_min_lasso = Hitters_lasso_cv$lambda.min
# Fit the lasso model with the minimum lambda value
# And then determine the test error
Hitter_lasso = glmnet(Hitters_train_predictors, Hitters_train_responses, alpha=1, lambda=lambda_min_lasso)
Hitters_lasso_predictions = predict(Hitters_lasso, s=lambda_min_lasso, newx=Hitters_test_predictors)
mean((Hitters_lasso_predictions - Hitters_test_responses)^2)
# See the coefficients of the fitted lasso model
coef(Hitters_lasso)</span>
```

```
library(pls) # The pls library is used for PCR
# We continue working with the Hitters data
# For PCR, we create full training and test datasets (predictors/responses are not separated)
Hitters_train = Hitters[train,]
Hitters_test = Hitters[-train,]
# The pcr function is used to perform PCR
# We set scale=TRUE to standardize the variables
# We set the validation type to cross-validation (the function performs 10-Fold CV)
set.seed(1)
Hitters_pcr = pcr(Salary ~., data=Hitters_train, scale=TRUE, validation="CV")
# The summary of the model shows the CV errors for models with different # of principal components
# It also shows the percent of variance explained (PVE) for models with different # of components
summary(Hitters_pcr)
# validationplot provides a quick visual of the # of principal components that result in lowest MSE
validationplot(Hitters_pcr, val.type="MSEP")
# Use the PCR model with 7 different components to make predictions on the test data
Hitters_pcr_predictions = predict(Hitters_pcr, Hitters_test, ncomp=7)
# Determine the test MSE
mean((Hitters_pcr_predictions - Hitters_test$Salary)^2)</span>
```

```
library(pls) # The pls library is used for PLS regression
# We continue working with the Hitters data
# The plsr function is used to perform PLS regression
set.seed(1)
Hitters_pls = plsr(Salary ~., data=Hitters_train, scale=TRUE, validation="CV")
# The summary output of PLS has a very similar format to the output of PCR
summary(Hitters_pls)
# validationplot provides a quick visual of the # of PLS components that result in lowest MSE
validationplot(Hitters_pls, val.type="MSEP")
# Use the PLS model with 2 principal components to make predictions on the test data
Hitters_pls_predictions = predict(Hitters_pls, Hitters_test, ncomp=2)
# Determine the test MSE
mean((Hitters_pls_predictions - Hitters_test$Salary)^2)</span>
```

]]>This is a summary of chapter 5 of the *Introduction to Statistical Learning* textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at my website, or here at Hashnode. Subscribe to stay up to date on my latest Data Science & Engineering guides!

Resampling methods involve repeatedly drawing samples from a training dataset and refitting a statistical model on each of the samples in order to obtain additional information about the fitted model.

For example, to estimate the variability of a linear regression model, we can repeatedly draw different samples from the training data, fit a linear regression model to each new sample, and then examine the extent to which the fits differ.

There are two resampling methods that are most common:

Cross validation can be used to estimate the test error of a specific statistical learning method, or to select the appropriate level of flexibility of a method.

Bootstrap can be used to help provide the accuracy of parameter estimates, or of a statistical learning method.

There is a difference between the training error and the test error. The test error is what we are much more interested in measuring. However, we do not always have a large test dataset available to estimate the test error.

Many techniques exist to estimate the test error with the training data. These techniques fall into one of two categories:

- Applying a mathematical adjustment to the training error
- Holding out a subset of the training data from the fitting process and applying the model to the held-out data

The Validation Set Approach is a simple strategy for estimating the test error.

The available training data is randomly divided into a training set and test (hold-out) set. A model is fit to the training set, and is used to make predictions on the data in the test set. The error from the predictions on the test set serves as the estimate for the test error.

In regression, the measure of error is usually the Mean Squared Error (MSE). In classification, the measure of error is usually the misclassification rate.

To be sure about which model to fit, the Validation Set Approach process could be repeated multiple times to allow for differences in random data assignment between the training and test sets.

There are two drawbacks associated with the Validation Set Approach:

- The validation estimate of the test error can be highly variable, depending on precisely which observations are randomly assigned to the training set and test set.
- Since only a subset of the observations are used to fit the model, the validation set error may overestimate the test error because statistical methods tend to perform worse when trained on fewer observations.

Cross-validation is a refinement to the Validation Set Approach that addresses these drawbacks.

Leave-One-Out Cross-Validation (LOOCV) is closely related to the Validation Set Approach, but attempts to address the drawbacks.

Similar to the Validation Set Approach, LOOCV involves splitting a full dataset into separate training and test sets. However, only a single observation is included in the test set, and all of the remaining observations are assigned to the training set.

A model is fit to the training set, and a prediction is made for the single excluded observation in the test set. The test error is determined for the prediction. Then, the LOOCV procedure is repeated to individually exclude each observation from the full dataset. Finally, the LOOCV estimate for the test error is simply the average of all of the individual test errors.

LOOCV has two advantages over the Validation Set Approach:

- LOOCV is not highly variable because it will yield identical results each time it is performed.
- LOOCV has far less bias because almost the entire dataset is being used to fit a model. Therefore, LOOCV does not overestimate the test error as much as the Validation Set Approach.

However, LOOCV has one potentially major disadvantage:

K-Fold Cross-Validation (K-Fold CV) is an alternative to LOOCV, and is typically the preferred method.

It involves randomly dividing observations into *K* groups (folds) of approximately the same size.

The first fold is treated as the test set, and a model is fit to the remaining folds. The fitted model is used to make predictions on the observations in the first fold, and the test error is measured. The K-Fold CV procedure is repeated *K* times so that a different fold is treated as the test set each time. Finally, the K-Fold CV estimate of the test error is simply the average of the *K* test measures.

LOOCV is basically a special case of K-Fold CV, in which *K* is set equal to *n*.

K-Fold CV has two advantages over LOOCV:

- K-Fold CV is much less computationally expensive.
- K-Fold CV is advantangeous when it comes to the Bias-Variance tradeoff.

The K-Fold CV test error estimates have a lower variance than the LOOCV estimates because more observations are included in the test set. Overall, K-Fold CV provides a more accurate measure of the test error.

Typically, *K* = 5 or *K* = 10 is chosen for K-Fold CV because these have been shown to empirically yield test error estimates that suffer neither from excessively high bias nor variance.

The bootstrap is a widely acceptable and extremely powerful statistical tool that can be used to quantify the uncertainty associated with an estimator or statistical learning method.

The point of the bootstrap can be illustrated with an example where we are trying to figure out the best investment allocation under a simple model.

Assume that we want to invest $10,000 into two stocks that yield X and Y, respectively:

- Stock A: X Return (random quantity)
- Stock B: Y Return (random quantity)
- Stock A and Stock B are not the same, and have different means and variances for their returns

We will invest *α* into A, and 1-*α* into B. Additionally, *α* will be chosen to minimize the total risk of investment since there is variability associated with the returns. The equation to determine the α value that minimizes risk is as follows:

The variances are unknown, but could be estimated through past data of returns for the investments. Plugging the variances into the equation would let us know the value for *α*.

However, since we only have one dataset, we can only determine one value for *α*. But what if we wanted to estimate the accuracy of the estimate for *α*? To estimate a variance for the *α* value itself, we would need hundreds of datasets. However, we usually cannot simply generate new samples from the original population. The bootstrap approach allows us to *emulate* the process of obtaining new sample datasets so that we can estimate variability without *actually* generating new sample datasets. Instead of obtaining new independent datasets from the population, the bootstrap method obtains distinct datasets by repeatedly sampling observations from the original dataset.

For example, assume that we have a dataset of 15 observations for the returns of A and B. The bootstrap method randomly selects 15 observations from the dataset to produce a new bootstrapped dataset. The selection is done with replacement, meaning that it is possible for the same observation to occur more than one time in the bootstrapped dataset. The bootstrapping procedure is repeated a large number of times (usually 1000) so that 1000 bootstrapped datasets are generated. These datasets would give us 1000 values for *α*, which would help us ultimately determine the variance for *α*.

*Originally published at* https://www.bijenpatel.com *on August 5, 2020.*

I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!

```
library(ISLR)
library(MASS)
library(ggplot2)
library(dplyr)
# We will work with the Auto dataset
head(Auto)
# The set.seed function allows reproducibility of results any time that random numbers are being generated
# It takes an arbitrary integer value
set.seed(1)
# The Validation Set approach involves randomly splitting the data into two halves
nrow(Auto) # Determine how many rows of data we have
train = sample(392, 196) # Randomly choose which 196 rows to use for the training data
Auto_train = Auto[train,] # Create a training set
Auto_test = Auto[-train,] # Create a test set
# Fit a simple linear model to the training data
# We are attempting to predict a vehicle's Miles per Gallon with Horsepower as the predictor
Auto_lm_1 = lm(mpg ~ horsepower, data=Auto_train)
# Use the fitted model to predict on the held-out test data
Auto_test_predictions = predict(Auto_lm_1, Auto_test)
# Determine the Mean Squared Error (MSE) of the test data, which is a measure of the test error
Auto_test_mse = data.frame(mpg=Auto_test$mpg, predicted_mpg=Auto_test_predictions)
Auto_test_mse$sq_error = (Auto_test_mse$mpg - Auto_test_mse$predicted_mpg)^2
Auto_lm_1_mse = mean(Auto_test_mse$sq_error)
# Fit a polynomial regression model instead, which seems to be a better choice
Auto_lm_2 = lm(mpg ~ horsepower + I(horsepower^2), data=Auto_train)
Auto_test_predictions_2 = predict(Auto_lm_2, Auto_test)
Auto_test_mse_2 = data.frame(mpg=Auto_test$mpg, predicted_mpg=Auto_test_predictions_2)
Auto_test_mse_2$sq_error = (Auto_test_mse_2$mpg - Auto_test_mse_2$predicted_mpg)^2
Auto_lm_2_mse = mean(Auto_test_mse_2$sq_error)
# Running same models with different random training/test splits results in different errors
set.seed(2)
train = sample(392, 196)
# Repeat above models to see that MSE will be slightly different
# This is because variance increases due to only using half the data to fit the model</span>
```

```
library(boot) # For the cv.glm function
# We continue working with the Auto data
# We use the glm function instead of the lm function to fit a linear regression model
# This is because a glm object can be used with the cv.glm function from the boot library
# We fit the model to the entire Auto data because cv.glm will perform the LOOCV for us
Auto_glm_1 = glm(mpg ~ horsepower, data=Auto)
# Perform LOOCV and determine the test error
Auto_glm_1_cv = cv.glm(data=Auto, glmfit=Auto_glm_1) # Perform LOOCV
Auto_glm_1_cv$delta # Test error is stored in the delta component
# There are two values
# The first value is the standard CV estimate
# The second value is a bias-corrected CV estimate
# What if we wanted to get LOOCV test error estimates for five different polynomial models?
# Easily fit five different polynomial models from degree 1 to 5 and compare their test errors
Auto_loocv_errors = rep(0, 5) # Create initial vector for the errors</span><span id="c3a1" class="de kv ii ef lp b db lt lu lv lw lx lr w ls"># The for loop goes through all five degrees and stores their errors
for (i in 1:5){
Auto_glm = glm(mpg ~ poly(horsepower, i), data=Auto)
Auto_loocv_errors[i] = cv.glm(data=Auto, glmfit=Auto_glm)$delta[1]
}
# Plot the different model degrees and their errors with ggplot
Auto_loocv_errors = data.frame(Degree = c(1, 2, 3, 4, 5), Error = Auto_loocv_errors)
Auto_loocv_errors_plot = ggplot(Auto_loocv_errors, aes(x=Degree, y=Error)) +
geom_line() +
labs(title="Auto LOOCV Errors", subtitle="Polynomial Models (Degrees 1 to 5)", x="Degree", y="LOOCV Error") +
geom_line(colour="red")
plot(Auto_loocv_errors_plot)</span>
```

```
# What if we wanted K-Fold test error estimates instead of LOOCV?
# We can easily repurpose the previous for loop for K-Fold CV
# We will perform 10-Fold CV on the Auto data
Auto_tenfold_errors = rep(0, 5) # Create initial vector for the errors
for (i in 1:5){
Auto_glm = glm(mpg ~ poly(horsepower, i), data=Auto)
Auto_tenfold_errors[i] = cv.glm(data=Auto, glmfit=Auto_glm, K=10)$delta[1]
}
# Simply change the previous for loop by adding K=10
# Plot the different model degrees and their K-Fold errors with ggplot
Auto_tenfold_errors = data.frame(Degree = c(1, 2, 3, 4, 5), Error = Auto_tenfold_errors)
Auto_tenfold_errors_plot = ggplot(Auto_tenfold_errors, aes(x=Degree, y=Error)) +
geom_line() +
labs(title="Auto 10-Fold Errors", subtitle="Polynomial Models (Degrees 1 to 5)", x="Degree", y="10-Fold Error") +
geom_line(colour="red")
plot(Auto_tenfold_errors_plot)</span>
```

```
library(boot) # The boot library is also for bootstrap
# Bootstrap can be used to assess the variability of coefficient estimates of a model
# First, we create a simple function to get coefficient values from a linear regression model
coef_function = function(data, index)
return(lm(mpg ~ horsepower + I(horsepower^2), data=data, subset=index)$coefficients)
# Test the function by fitting a linear model to the entire Auto dataset
coef_function(Auto, 1:392)
# We specified data=Auto and index=1:392
# Now, we test the function with a dataset through the bootstrap resampling method
# Bootstrap creates a new dataset by randomly sampling from the original data WITH replacement
set.seed(1)
coef_function(Auto, sample(392, 392, replace=T)
# So far, that was only one result from one bootstrap dataset
# What if we wanted to get 1000 results?
# Use the boot function to get 1000 bootstrap estimates for the coefficients and standard errors
boot(Auto, coef_function, 1000)
# Compare results of the bootstrap estimates to the standard quadratic model on the original data
summary(lm(mpg ~ horsepower + I(horsepower^2), data=Auto))$coef</span>
```

]]>This is a summary of chapter 4 of the *Introduction to Statistical Learning* textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at my website, or here at Hashnode. Subscribe to stay up to date on my latest Data Science & Engineering guides!

Qualitative variables, such as gender, are known as categorical variables. Predicting qualitative responses is known as classification.

Some real world examples of classification include determining whether or not a banking transaction is fraudulent, or determining whether or not an individual will default on credit card debt.

The three most widely used classifiers, which are covered in this post, are:

- Logistic Regression
- Linear Discriminant Analysis
- K-Nearest Neighbors

There are also more advanced classifiers, which are covered later:

- Generalized Additive Models
- Trees
- Random Forests
- Boosting
- Support Vector Machines

Logistic regression models the probability that the response *Y* belongs to a particular category.

For example, assume that we have data on whether or not someone defaulted on their credit. The data includes one predictor for the credit balance that someone had.

Logistic regression would model the probability of default, given credit balance:

Additionally, a probability threshold can be chosen for the classification.

For example, if we choose a probability threshold of 50%, then we would indicate any observation with a probability of 50% or more as “default.”

However, we could also choose a more conservative probability threshold, such as 10%. In this case, any observation with a probability of 10% or more would be indicated as “default.”

The logistic function is used to model the relationship between the probability (*Y*) and some predictor (*X*) because the function falls between 0 and 1 for all *X* values. The logistic function has the form:

The logistic function always produces an S-shaped curve.

Additionally, the logistic function can also be rewritten as a logit function:

The logistic regression model for Credit Default data may look like the chart below.

This equation can be interpreted as a one unit increase in *X* changing the log-odds or logit (left side of equation) by *β*₁.

The fraction inside the log() is known as the odds. In the context of the Credit Default data, the odds would indicate the ratio of the probability of defaulting versus the probability of not defaulting. For example, on average, 1 in 5 people with an odds of 1/4 will default. On the other hand, on average, 9 out of 10 people with an odds of 9/1 will default.

So, alternatively, the logit function can also be interpreted as a one-unit increase in *X* multiplying the odds by *e*^*β*₁.

*β*₁ cannot be interpreted as a specific change in value for the probability. The only conclusion that can be made is that if the coefficient is positive, then an increase in *X* will increase the probability, whereas if the coefficient is negative, then an increase in *X* will decrease the probability.

The coefficients are estimated through the maximum likelihood method.

In the context of the Credit Default data, the maximum likelihood estimate essentially attempts to find *β*₀ and *β*₁ such that plugging these estimates into the logistic function results in a number close to 1 for individuals who defaulted, and a number close to 0 for individuals who did not default.

In other words, maximum likelihood chooses coefficients such that the predicted probability of each observation in the data corresponds as closely as possible to the actual observed status.

So, how do we determine whether or not there truly is a relationship between the probability of a class and some predictor?

Similar to the linear regression setting, we conduct a hypothesis test:

*z*-statistic and *p*-value

In logistic regression, we have a *z*-statistic instead of the *t*-statistic that we had in linear regression. However, they are essentially the same.

The *z*-statistic measures the number of standard deviations that *β*₁ is away from 0.

The *z*-statistic allows us to determine the *p*-value, which ultimately helps determine whether or not the coefficient is non-zero.

The *p*-value indicates how likely it is to observe a meaningful association between the probability of a class and some predictor *X* by some bizarre random error or chance, as opposed to there being a true relationship between them.

Typically, we want *p*-values less than 5% or 1% to reject the null hypothesis. In other words, rejecting the null hypothesis means that we are declaring that some relationship exists.

What if our dataset had multiple predictors. For example, let’s expand our Credit Default dataset to include two additional predictors: student status and income.

Similar to how the simple linear regression model was extended to multiple linear regression, the logistic regression model is extended in a related fashion:

The interpretation of the coefficients remains nearly the same. However, when interpreting one of the coefficients, we have to indicate that the values of the other predictors remain fixed.

What if we had to classify observations in more than two classes? In the Credit Default data, we only had two classes: Default and No Default.

For example, assume that we had a medical dataset and had to classify medical conditions as either a stroke, drug overdose, or seizure.

The two-class logistic regression models have multiple-class extensions, but are not used often.

Discriminant analysis is the popular approach for multiple-class classification.

Linear discriminant analysis is an alternative approach to classification that models the distributions of the different predictors separately in each of the response classes (*Y*), and then uses Bayes’ theorem to flip these around into estimates.

When classes are well separated in a dataset, logistic regression parameter estimates are unstable, whereas linear discriminant analysis does not have this problem.

If the number of observations in a dataset is small, and the distribution of the predictors is approximately normal, then linear discriminant analysis will typically be more stable.

Additionally, as mentioned previously, linear discriminant analysis is the popular approach for scenarios in which we have more than two classes in the response.

Assume that we have a qualitative response variable (*Y*) that can take on *K* distinct class values.

*π* represents the prior probability that a randomly chosen observation comes from class *K*.

Bayes’ theorem states that:

This is the posterior probability that an observation belongs to some class *K*.

The Bayes’ classifier has the lowest possible error rate out of all classifiers because it classifies an observation to the class for which the *Pr*(*Y*=*k*|*X*=*x*) is largest.

Estimating *π* is easy if we have random sample data from a population. We simply determine the fraction of observations that belong to some class *K*.

However, estimating *f*(*X*) is more challenging unless we assume simple density forms.

Suppose that *f*(*X*) is a normal distribution.

*μ* represents the class-specific mean.

*σ* represents the class-specific variance. However, we further assume that all classes have variances that are equal.

The Bayes’ classifier assigns observations to the class for which the following is largest:

The above equation is obtained through some mathematical simplification of the Bayes probability formula.

However, the Bayes’ classifier can only be determined if we know that *X* is drawn from a normal distribution, and know all of the parameters involved, which does not happen in real situations. Additionally, even if we were sure that *X* was drawn from a normal distribution, we would still have to estimate the parameters.

Linear discriminant analysis approximates the Bayes’ classifier by using these estimates in the previous equation:

*n*— represents the total number of observations*n*(*k*) — represents the total number of observations in class*K**K*— represents the total number of classes

Linear discriminant analysis assumes a normal distribution and common variance among the classes.

Linear discriminant analysis can be extended to allow for multiple predictors. In this scenario, we assume that the predictors come from a multivariate normal distribution with class-specific means and a common covariance matrix. Additionally, we assume that each predictor follows a one-dimensional normal distribution with some correlation between each pair of predictors.

For example, linear discriminant analysis could be used on the Credit Default dataset with the multiple predictors.

Fitting a linear discriminant model to the full Credit Default data in the ISLR R package results in an overall training error rate of 2.75%. This may seem low at first, but there are a couple of key things to keep in mind:

- Training error rates will usually be lower than test error rates.
- The Credit Default dataset is skewed. Only 3.33% of people in the data defaulted. Therefore, a simple and useless classifier that always predicts that each individual will not default will have an error rate of 3.33%.

For these reasons, it is often of interest to look at a confusion matrix because binary classifiers can make two types of errors:

- Incorrectly assign an individual who defaults to the No Default category.
- Incorrectly assign an individual who does not default to the Default category.

A confusion matrix is a convenient way to display this information, and looks as follows for the linear discriminant model (50% probability threshold) fit to the full Credit Default data:

The confusion matrix shows that out of the 333 individuals who defaulted, 252 were missed by linear discriminant analysis, which is a 75.7% error rate. This is known as a class-specific error rate.

Linear discriminant analysis does a poor job of classifying customers who default because it is trying to approximate the Bayes’ classifier, which has the lowest **total** error rate instead of class-specific error rate.

The probability threshold for determining defaults could be lowered to improve the model. Lowering the probability threshold to 20% results in the following confusion matrix:

Now, linear discriminant analysis correctly predicts 195 individuals who defaulted, out of the 333 total. This is an improvement over the previous error rate. However, 235 individuals who do not default are classified as defaulters, compared to only 23 previously. This is the tradeoff that results from lowering the probability threshold.

The threshold value to use in a real situation should be based on domain and industry knowledge, such as information about the costs associated with defaults.

The ROC curve is a popular graphic for simultaneously displaying the two types of errors for all possible thresholds.

The overall performance of a classifier summarized over all possible thresholds is given by the area under the curve (AUC). An ideal ROC curve will hug the top left corner, so the larger the AUC, the better the classifier. A classifier that performs no better than chance will have an AUC of 0.50.

Below is an example of an ideal ROC curve (blue) versus an ROC curve that indicates that the model performs no better than chance (black).

ROC curves are useful for comparing different classifiers because they account for all possible probability threshold values.

The *Y* axis indicates the True Positive Rate, which is also known as the sensitivity. In the context of the Credit Default data, it represents the fraction of defaulters who are correctly classified.

The *X* axis indicates the False Positive Rate, which is also known as the one-minus-specificity. In the context of the Credit Default data, it represents the fraction of non-defaulters who are incorrectly classified.

The ROC curve is very useful in comparing different classification models for the same dataset because it accounts for all possible threshold values. If the AUC of one model is much better than the others, it is the best model to use.

Linear discriminant analysis assumes that observations within each class are drawn from a multivariate normal distribution with class-specific means and a common covariance matrix for all of the classes.

Quadratic discriminant analysis assumes that each class has its own covariance matrix. In other words, quadratic discriminant analysis relaxes the assumption of the common covariance matrix.

Which method is better for classification? LDA or QDA? The answer lies in the bias-variance tradeoff.

LDA is a less flexible classifier, meaning it has lower variance than QDA. However, if the assumption of the common covariance matrix is badly off, then LDA could suffer from high bias.

In general, LDA tends to be a better classifier than QDA if there are relatively few observations in the training data because reducing variance is crucial in this case.

In general, QDA is recommended over LDA if the training data is large, meaning that the variance of the classifier is not a major concern. QDA is also recommended over LDA if the assumption of the common covariance matrix is flawed.

K-Nearest Neighbors (KNN) is a popular nonparametric classifier method.

Given a positive integer *K* and some test observation, the KNN classifier identifies the *K* points in the training data that are closest to the test observation. These closest *K* points are represented by *N*₀. Then, it estimates the conditional probability for a class as the fraction of points in *N*₀ that represent that specific class. Lastly, KNN will apply the Bayes’ rule and classify the test observation to the class with the largest probability.

However, the choice of the *K* value is very important. Lower values are more flexible, whereas higher values are less flexible but have more bias. Similar to the regression setting, a bias-variance tradeoff exists.

There are four main classification methods: logistic regression, LDA, QDA, and KNN.

Logistic regression and LDA are closely connected. They both produce linear decision boundaries. However, LDA may provide improvement over logistic regression when the assumption of the normal distribution with common covariance for classes holds. Additionally, LDA may be better when the classes are well separated. On the other hand, logistic regression outperforms LDA when the normal distribution assumption is not met.

KNN is a completely non-parametric approach. There are no assumptions made about the shape of the decision boundary. KNN will outperform both logistic regression and LDA when the decision boundary is highly nonlinear. However, KNN does not indicate which predictors are important.

QDA serves as a compromise between the nonparametric KNN method and the linear LDA and logistic regression methods.

*Originally published at* https://www.bijenpatel.com *on August 4, 2020.*

I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!

```
library(MASS) # For model functions
library(ISLR) # For datasets
library(ggplot2) # For plotting
library(dplyr) # For easy data manipulation functions
library(caret) # For confusion matrix function
library(e1071) # Requirement for caret library
# Working with the Stock Market dataset to predict direction of stock market movement (up/down)
head(Smarket)
summary(Smarket)
# Fit a logistic regression model to the data
# The direction of the stock market is the response
# The lags (% change in market on previous day, 2 days ago, etc.) and trade volume are the predictors
Smarket_logistic_1 = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family="binomial", data=Smarket)
summary(Smarket_logistic_1)
# Use the contrasts function to determine which direction is indicated as "1"
contrasts(Smarket$Direction)
## Up
## Down 0
## Up 1
# Use the names function to determine the objects in the logistic model
names(Smarket_logistic_1)
# Get confusion matrix to determine accuracy of the logistic model
Smarket_predictions_1 = data.frame(Direction=Smarket_logistic_1$fitted.values)</span><span id="f484" class="de ke ii ef ly b db mc md me mf mg ma w mb">Smarket_predictions_1 = mutate(Smarket_predictions_1, Direction = ifelse(Direction >= 0.50, "Up", "Down")</span><span id="9340" class="de ke ii ef ly b db mc md me mf mg ma w mb">Smarket_predictions_1$Direction = factor(Smarket_predictions_1$Direction, levels=c("Down", "Up"), ordered=TRUE)</span><span id="bf0d" class="de ke ii ef ly b db mc md me mf mg ma w mb">confusionMatrix(Smarket_predictions_1$Direction, Smarket$Direction, positive="Up")
# Instead of fitting logistic model to the entire data, split the data into training/test sets
Smarket_train = filter(Smarket, Year <= 2004)
Smarket_test = filter(Smarket, Year == 2005)
Smarket_logistic_2 = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family="binomial", data=Smarket_train)
# Use the model to make predictions on the test data
Smarket_test_predictions = predict(Smarket_logistic_2, Smarket_test, type="response")
# Get confusion matrix to determine accuracy on the test dataset
Smarket_test_predictions = data.frame(Direction=Smarket_test_predictions)</span><span id="e401" class="de ke ii ef ly b db mc md me mf mg ma w mb">Smarket_test_predictions = mutate(Smarket_test_predictions, Direction = ifelse(Direction >= 0.50, "Up", "Down"))</span><span id="57f9" class="de ke ii ef ly b db mc md me mf mg ma w mb">Smarket_test_predictions$Direction = factor(Smarket_test_predictions$Direction, levels=c("Down", "Up"), ordered=TRUE)</span><span id="5c57" class="de ke ii ef ly b db mc md me mf mg ma w mb">confusionMatrix(Smarket_test_predictions$Direction, Smarket_test$Direction)
# Fit a better logistic model that only considers the most important predictors (Lag1, Lag2)
Smarket_logistic_3 = glm(Direction ~ Lag1 + Lag2, family="binomial", data=Smarket_train)
# Repeat the procedure to obtain a confusion matrix for the test data ...</span>
```

```
# Continuing to work with the Stock Market dataset
# The lda function is used to fit a linear discriminant model
Smarket_lda = lda(Direction ~ Lag1 + Lag2, data=Smarket_train)
Smarket_lda # View the prior probabilities, group means, and coefficients
# Plot of the linear discriminants of each observation in the training data, separated by class
plot(Smarket_lda)
# Use the model to make predictions on the test dataset
Smarket_lda_predictions = predict(Smarket_lda, Smarket_test)
# View the confusion matrix to assess accuracy
confusionMatrix(Smarket_lda_predictions$class, Smarket_test$Direction, positive="Up")
# Notice that a predicted probability >= 50% actually corresponds to "Down" in LDA
Smarket_lda_predictions$posterior[1:20]
Smarket_lda_predictions$class[1:20]</span>
```

```
# Continue to work with the Stock Market dataset
# The qda function is used to fit a quadratic discriminant model
Smarket_qda = qda(Direction ~ Lag1 + Lag2, data=Smarket_train)
Smarket_qda # View the prior probabilities and group means
# Use the model to make predictions on the test dataset
Smarket_qda_predictions = predict(Smarket_qda, Smarket_test)
# View the confusion matrix to assess accuracy
confusionMatrix(Smarket_qda_predictions$class, Smarket_test$Direction, positive="Up")
# Notice that a predicted probability >= 50% actually corresponds to "Down" in QDA
Smarket_qda_predictions$posterior[1:20]
Smarket_qda_predictions$class[1:20]</span>
```

```
# Continue to work with the Stock Market dataset
library(class) # The class library is used for KNN
# Before performing KNN, separate dataframes are made for the predictors and responses
Smarket_train_predictors = data.frame(Lag1=Smarket_train$Lag1, Lag2=Smarket_train$Lag2)
Smarket_test_predictors = data.frame(Lag1=Smarket_test$Lag1, Lag2=Smarket_test$Lag2)
Smarket_train_response = Smarket_train$Direction
Smarket_test_response = Smarket_test$Direction
# Perform KNN with K=3
set.seed(1)
Smarket_predictions_knn = knn(Smarket_train_predictors,
Smarket_test_predictors,
Smarket_train_response,
k=3)
# See a confusion matrix to assess the accuracy of KNN
confusionMatrix(Smarket_predictions_knn, Smarket_test_response)
# Next, we will use KNN on the Caravan data to predict whether or not someone will purchase caravan insurance
# The data contains demographic data on individuals, and whether or not insurance was bought
# Before performing KNN, predictors should be scaled to have mean 0 and standard deviation 1
Caravan_scaled = scale(Caravan[,-86])
# We will designate the first 1000 rows as the test data and the remaining as training data
# Create separate datasets for the predictors and responses
Caravan_test_predictors = Caravan_scaled[1:1000,]
Caravan_train_predictors = Caravan_scaled[1001:5822,]
Caravan_test_response = Caravan[1:1000, 86]
Caravan_train_response = Caravan[1001:5822, 86]
# Perform KNN
set.seed(1)
Caravan_knn_predictions = knn(Caravan_train_predictors,
Caravan_test_predictors,
Caravan_train_response,
k=5)
# Assess accuracy
confusionMatrix(Caravan_knn_predictions, Caravan_test_response, positive="Yes")
# 26.67% accuracy is better than 6% from random guessing</span>
```

]]>This is a summary of chapter 3 of the

Introduction to Statistical Learningtextbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at my website, or here at Hashnode. Subscribe to stay up to date on my latest Data Science & Engineering guides!

Simple linear regression assumes a linear relationship between the predictor (*X*) and the response (*Y*). A simple linear regression model takes the following form:

*y-hat*— represents the predicted value*β*₀— represents a coefficient known as the intercept*β*₁ — represents a coefficient known as the slope*X*— represents the value of the predictor

For example, we could build a simple linear regression model from the following statistician salary dataset:

The simple linear regression model could be written as follows:

The best estimates for the coefficients (*β*₀, *β*₁) are obtained by finding the regression line that fits the training dataset points as closely as possible. This line can be obtained by minimizing the least squares criteria.

What does it mean to minimize the least squares criteria? Let’s use the example of the regression model for the statistician salaries.

The difference between the actual salary value and the predicted salary is known as the residual (*e*). The residual sum of squares (RSS) is defined as:

The least squares criteria chooses the *β* coefficient values that minimize the RSS.

For our statistician salary dataset, the linear regression model determined through the least squares criteria is as follows:

*β*₀ is $70,545*β*₁ is $2,576

This final regression model can be visualized by the orange line below:

How do we interpret the coefficients of a simple linear regression model in plain English? In general, we say that:

- If the predictor (
*X*) were 0, the prediction (*Y*) would be*β*₀, on average. - For every one increase in the predictor, the prediction changes by
*β*₁, on average.

Using the example of the final statistician salary regression model, we would conclude that:

- If a statistician had 0 years of experience, he/she would have an entry-level salary of $70,545, on average.
- For every one additional year of experience that a statistician has, his/her salary increases by $2,576, on average.

The true population regression line represents the “true” relationship between *X* and *Y*. However, we never know the true relationship, so we use least squares regression to estimate it with the data that we have available.

For example, assume that the true population regression line for statistician salaries was represented by the black line below. The least squares regression line, represented by the orange line, is close to the true population regression line, but not exactly the same.

So, how do we estimate how accurate the least squares regression line is as an estimate of the true population regression line?

We compute the standard error of the coefficients and determine the confidence interval.

The standard error is a measure of the accuracy of an estimate. Knowing how to mathematically calculate the standard error is not important, as programs like R will determine them easily.

Standard errors are used to compute confidence intervals, which provide an estimate of how accurate the least squares regression line is.

The most commonly used confidence interval is the 95% confidence interval.

The confidence interval is generally interpreted as follows:

- There is a 95% probability that the interval contains the true population value of the coefficient.

For example, for the statistician salary regression model, the confidence intervals are as follows:

*β*₀ = [67852,72281]*β*₁ = [1989,3417]

In the context of the statistician salaries, these confidence intervals are interpreted as follows:

- In the absence of any years of experience, the salary of an entry-level statistician will fall between $67,852 and $72,281.
- For each one additional year of experience, a statistician’s salary will increase between $1,989 and $3,417.

So, how do we determine whether or not there truly is a relationship between *X* and *Y*? In other words, how do we know that *X* is actually a good predictor for *Y*?

We use the standard errors to perform hypothesis tests on the coefficients.

The most common hypothesis test involves testing the null hypothesis versus the alternative hypothesis:

- Null Hypothesis: No relationship between
*X*and*Y* - Alternative Hypothesis: Some relationship between
*X*and*Y*

*t*-statistic and *p*-value

So, how do we determine if *β*₁ is non-zero? We use the estimated value of the coefficient and its standard error to determine the *t*-statistic:

The *t*-statistic measures the number of standard deviations that *β*₁ is away from 0.

The *t*-statistic allows us to determine something known as the *p*-value, which ultimately helps determine whether or not the coefficient is non-zero.

The *p*-value indicates how likely it is to observe a meaningful association between *X* and *Y* by some bizarre random error or chance, as opposed to there being a true relationship between *X* and *Y*.

Typically, we want *p*-values less than 5% or 1% to reject the null hypothesis. In other words, rejecting the null hypothesis means that we are declaring that some relationship exists between *X* and *Y*.

There are 2 main assessments for how well a model fits the data: RSE and *R*².

**Residual Standard Error (RSE)**

The RSE is a measure of the standard deviation of the random error term (*ϵ*).

In other words, it is the average amount that the actual response will deviate from the true regression line. It is a measure of the lack of fit of a model.

The value of RSE and whether or not it is acceptable will depend on the context of the problem.

**R-Squared ( R²)**

*R*² measures the proportion of variability in *Y* that can be explained by using *X*. It is a proportion that is calculated as follows:

TSS is a measure of variability that is already inherent in the response variable before regression is performed.

RSS is a measure of variability in the response variable after regression is performed.

The final statistician salary regression model has an *R*² of 0.90, meaning that 90% of the variability in the salaries of statisticians is explained by using years of experience as a predictor.

Simple linear regression is useful for prediction if there is only one predictor. But what if we had multiple predictors? Multiple linear regression allows for multiple predictors, and takes the following form:

For example, let’s take the statistician salary dataset, add a new predictor for college GPA, and add 10 new data points.

The multiple linear regression model for the dataset would take the form:

The multiple linear regression model would fit a plane to the dataset. The dataset is represented below as a 3D scatter plot with an *X*, *Y*, and *Z* axis.

In multiple linear regression, we’re interested in a few specific questions:

- Is at least one of the predictors useful in predicting the response?
- Do all of the predictors help explain
*Y*, or only a few of them? - How well does the model fit the data?
- How accurate is our prediction?

Similar to simple linear regression, the coefficient estimates in multiple linear regression are chosen based on the same least squares approach that minimizes RSS.

The interpretation of the coefficients is also very similar to the simple linear regression setting, with one key difference (indicated in bold). In general, we say that:

- If all of the predictors were 0, the prediction (
*Y*) would be*β*₀, on average. - For every one increase in some predictor
*X*ⱼ, the prediction changes by*β*ⱼ, on average,**holding all of the other predictors constant**.

So, how do we determine whether or not there truly is a relationship between the _X_s and *Y*? In other words, how do we know that the _X_s are actually good predictors for *Y*?

Similar to the simple linear regression setting, we perform a hypothesis test. The null and alternative hypotheses are slightly different:

- Null Hypothesis: No relationship between the _X_s and
*Y* - Alternative Hypothesis: At least one predictor has a relationship to the response

*F*-statistic and *p*-value

So, how do we determine if at least one *β*ⱼ is non-zero? In simple regression, we determined the *t*-statistic. In multiple regression, we determine the *F*-statistic instead.

When there is no relationship between the response and predictors, we generally expect the *F*-statistic to be close to 1.

If there is a relationship, we generally expect the *F*-statistic to be greater than 1.

Similar to the *t*-statistic, the *F*-statistic also allows us to determine the *p*-value, which ultimately helps decide whether or not a relationship exists.

The *p*-value is essentially interpreted in the same way that it is interpreted in simple regression.

Typically, we want *p*-values less than 5% or 1% to reject the null hypothesis. In other words, rejecting the null hypothesis means that we are declaring that some relationship exists between the _X_s and *Y*.

*t*-statistics in Multiple Linear Regression

In multiple linear regression, we will receive outputs that indicate the *t*-statistic and *p*-values for each of the different coefficients.

However, we have to use the overall *F*-statistic instead of the individual coefficient *p*-values. This is because when the number of predictors is large (e.g. p=100), about 5% of the coefficients will have low *p*-values less than 5% just by chance. Therefore, in this scenario, choosing whether or not to reject the null hypothesis based on the individual *p*-values would be flawed.

So, after concluding that at least one predictor is related to the response, how do we determine which specific predictors are significant?

This process is called variable selection, and there are three approaches: forward selection, backward selection, and mixed selection.

**Forward Selection**

Assume that we had a dataset of credit card balance and 10 predictors.

Forward selection begins with a null model with no predictors:

Then, 10 different simple linear regression models are built for each of the predictors:

The predictor that results in the lowest RSS is then added to the initial null model. Assume that the Limit variable is the variable that results in the lowest RSS. The forward selection model would then become:

Then, a second predictor is added to this new model, which will result in building 9 different multiple linear regression models for the remaining predictors:

The second predictor that results in the lowest RSS is then added to the model. Assume that the model with the Income variable resulted in the lowest RSS. The forward selection model would then become:

This process of adding predictors is continued until some statistical stopping rule is satisfied.

**Backward Selection**

Backward selection begins with a model with all predictors:

Then, the variable with the largest *p*-value is removed, and the new model is fit.

Again, the variable with the largest *p*-value is removed, and the new model is fit.

This process is continued until some statistical stopping rule is satisfied, such as all variables in the model having low *p*-values less than 5%.

**Mixed Selection**

Mixed selection is a combination of forward and backward selection. We begin with a null model that contains no predictors. Then, variables are added one by one, exactly as done in forward selection. However, if at any point the *p*-value for some variable rises above a chosen threshold, then it is removed from the model, as done in backward selection.

Similar to the simple regression setting, RSE and *R*² are used to determine how well the model fits the data.

**RSE**

In multiple linear regression, RSE is calculated as follows:

**R-Squared ( R²)**

*R*² is interpreted in the same manner that it is interpreted in simple regression.

However, in multiple linear regression, adding more predictors to the model will always result in an increase in *R*².

Therefore, it is important to look at the magnitude at which *R*² changes when adding or removing a variable. A small change will generally indicate an insignificant variable, whereas a large change will generally indicate a significant variable.

How do we estimate how accurate the actual predictions are? Confidence intervals and prediction intervals can help assess prediction accuracy.

**Confidence Intervals**

Confidence intervals are determined through the *β* coefficient estimates and their inaccuracy through the standard errors.

This means that confidence intervals only account for reducible error.

**Prediction Intervals**

Reducible error isn’t the only type of error that is present in regression modeling.

Even if we knew the true values of the *β* coefficients, we would not be able to predict the response variable perfectly because of random error *ϵ* in the model, which is an irreducible error.

Prediction intervals go a step further than confidence intervals by accounting for both reducible and irreducible error. This means that prediction intervals will always be wider than confidence intervals.

It is possible to have qualitative predictors in regression models. For example, assume that we had a predictor that indicated gender.

For qualitative variables with only two levels, we could simply create a “dummy” variable that takes on two values:

We’d simply use that logic to create a new column for the dummy variable in the data, and use that for regression purposes:

But what if the qualitative variable had more than two levels? For example, assume we had a predictor that indicated ethnicity:

In this case, a single dummy variable cannot represent all of the possible values. In this situation, we create multiple dummy variables:

For qualitative variables with multiple levels, there will always be one fewer dummy variable than the total number of levels. In this example, we have three ethnicity levels, so we create two dummy variables.

The new variables for regression purposes would be represented as follows:

Regression models provide interpretable results and work well, but make highly restrictive assumptions that are often violated in practice.

Two of the most important assumptions state that the relationship between the predictors and response are additive and linear.

The additive assumption means that the effect of changes in some predictor *X*ⱼ on the response is independent of the values of the other predictors.

The linear assumption states that the change in the response *Y* due to a one-unit change in some predictor *X*ⱼ is constant, regardless of the value of *X*ⱼ.

Assume that we had an advertising dataset of money spent on TV ads, money spent on radio ads, and product sales.

The multiple linear regression model for the data would have the form:

This model states that the average effect on sales for a $1 increase in TV advertising spend is *β*₁, on average, regardless of the amount of money spent on radio ads.

However, it is possible that spending money on radio ads actually increases the effectiveness of TV ads, thus increasing sales further. This is known as an interaction effect or a synergy effect. The interaction effect can be taken into account by including an interaction term:

The interaction term relaxes the additive assumption. Now, every $1 increase in TV ad spend increases sales by *β*₁ + *β*₃(Radio).

Sometimes it is possible for the interaction term to have a low *p*-value, yet the main terms no longer have a low *p*-value. The hierarchical principle states that if an interaction term is included, then the main terms should also be included, even if the *p*-values for the main terms are not significant.

Interaction terms are also possible for qualitative variables, as well as a combination of qualitative and quantitative variables.

In some cases, the true relationship between the predictors and response may be non-linear. A simple way to extend the linear model is through polynomial regression.

For example, for automobiles, there is a curved relationship between miles per gallon and horsepower.

A quadratic model of the following form would be a great fit to the data:

When fitting linear models, there are six potential problems that may occur: non-linearity of data, correlation of error terms, nonconstant variance of error terms, outliers, high leverage data points, and collinearity.

Residual plots are a useful graphical tool for the identification of non-linearity.

In simple linear regression, the residuals are plotted against the predictor.

In multiple linear regression, the residuals are plotted against the predicted values.

If there is some kind of pattern in the residual plot, then it is an indication of potential non-linearity.

Non-linear transformations, such as log-transformation, of the predictors could be a simple method to solving the issue.

For example, take a look at the below residual graphs, which represent different types of fits for the automobile data mentioned previously. The graph on the left represents what the residuals look like if a simple linear model is fit to the data. Clearly, there is a curved pattern in the residual plot, indicating non-linearity. The graph on the right represents what the residuals look like if a quadratic model is fit to the data. Fitting a quadratic model seems to resolve the issue, as a pattern doesn’t exist in the plot.

Proper linear models should have residual terms that are uncorrelated. This means that the sign (positive or negative) of some residual should provide no information about the sign of the next residual.

If the error terms are correlated, we may have an unwarranted sense of confidence in the linear model.

To determine if correlated errors exist, we plot residuals in order of observation number. If the errors are uncorrelated, there should not be a pattern. If the errors are correlated, then we may see tracking in the graph. Tracking is where adjacent residuals have similar signs.

For example, take a look at the below graphs. The graph on the left represents a scenario in which residuals are not correlated. In other words, just because one residual is positive doesn’t seem to indicate that the next residual will be positive. The graph on the right represents a scenario in which the residuals are correlated. There are 15 residuals in a row that are all positive.

Proper linear models should also have residual terms that have a constant variance. The standard errors, confidence intervals, and hypothesis tests associated with the model rely on this assumption.

Nonconstant variance in errors is known as heteroscedasticity. It is identified as the presence of a funnel shape in the residual plot.

One solution to nonconstant variance is to transform the response using a concave function, such as *log*(*Y*) or the square root of *Y*. Another solution is to use weighted least squares instead of ordinary least squares.

The graphs below represent the difference between constant and nonconstant variance. The residual plot on the right has a funnel shape, indicating nonconstant variance.

Depending on how many outliers are present and their magnitude, they could either have a minor or major impact on the fit of the linear model. However, even if the impact is small, they could cause other issues, such as impacting the confidence intervals, *p*-values, and *R*².

Outliers are identified through various methods. The most common is studentized residuals, where each residual is divided by its estimated standard error. Studentized residuals greater than 3 in absolute value are possible outliers.

These outliers can be removed from the data to come up with a better linear model. However, it is also possible that outliers indicate some kind of model deficiency, so caution should be taken before removing the outliers.

The red data point below represents an example of an outlier that would greatly impact the slope of a linear regression model.

Observations with high leverage have an unusual value compared to the other observation values. For example, you might have a dataset of *X* values between 0 and 10, and just one other data point with a value of 20. The value of 20 is a high leverage data point.

High leverage is determined through the leverage statistic. The leverage statistic is always between 1/*n* and 1.

The average leverage is defined as:

If the leverage statistic of a data point is greatly higher than the average leverage, then we have reason to suspect high leverage.

The red data point below represents an example of a high leverage data point that would impact the linear regression fit.

Collinearity refers to the situation in which 2 or more predictor variables are closely related. Collinearity makes it difficult to separate out the individual effects of collinear variables on the response. It also reduces the accuracy of the estimates of the regression coefficients by causing the coefficient standard errors to grow, thus reducing the credibility of hypothesis testing.

A simple way to detect collinearity is to look at the correlation matrix of the predictors.

However, not all collinearity can be detected through the correlation matrix. It is possible for collinearity to exist between multiple variables instead of pairs of variables, which is known as multicollinearity.

The better way to assess collinearity is through the Variance Inflation Factor (VIF). VIF is the ratio of the variance of a coefficient when fitting the full model, divided by the variance of the coefficient when fitting a model only on its own. The smallest possible VIF is 1. In general, a VIF that exceeds 5 or 10 may indicate a collinearity problem.

One way to solve the issue of collinearity is to simply drop one of the predictors from the linear model. Another solution is to combine collinear variables into one variable.

The chart below demonstrates an example of collinearity. As we know, an individual’s credit limit is directly related to their credit rating. A dataset that includes both of these predictor should only include one of them for regression purposes, to avoid the issue of collinearity.

*Originally published at* https://www.bijenpatel.com *on August 3, 2020.*

I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!

```
library(MASS) # For model functions
library(ISLR) # For datasets
library(ggplot2) # For plotting
# Working with the Boston dataset to predict median house values
head(Boston)
# Fit a simple linear regression model
# Median house value is the response (Y)
# Percentage of low income households in the neighborhood is the predictor
Boston_lm = lm(medv ~ lstat, data=Boston)
# View the fitted simple linear regression model
summary(Boston_lm)
# View all of the objects stored in the model, and get one of them, such as the coefficients
names(Boston_lm)
Boston_lm$coefficients
# 95% confidence interval of the coefficients
confint(Boston_lm, level=0.95)
# Use the model to predict house values for specific lstat values
lstat_predict = c(5, 10, 15)
lstat_predict = data.frame(lstat=lstat_predict)
predict(Boston_lm, lstat_predict) # Predictions
predict(Boston_lm, lstat_predict, interval="confidence", level=0.95)
predict(Boston_lm, lstat_predict, interval="prediction", level=0.95)
# Use ggplot to create a residual plot of the model
Boston_lm_pred_resid = data.frame(Prediction=Boston_lm$fitted.values, Residual=Boston_lm$residuals)
Boston_resid_plot = ggplot(Boston_lm_pred_resid, aes(x=Prediction, y=Residual)) +
geom_point() +
labs(title="Boston Residual Plot", x="House Value Prediction", y="Residual")
plot(Boston_resid_plot)
# Use ggplot to create a studentized residual plot of the model (for outlier detection)
Boston_lm_pred_Rstudent = data.frame(Prediction=Boston_lm$fitted.values, Rstudent=rstudent(Boston_lm))
Boston_Rstudent_plot = ggplot(Boston_lm_pred_Rstudent, aes(x=Prediction, y=Rstudent)) +
geom_point() +
labs(title="Boston Rstudent Plot", x="House Value Prediction", y="Rstudent")
plot(Boston_Rstudent_plot)
# Determine leverage statistics for the lstat values (for high leverage detection)
Boston_leverage = hatvalues(Boston_lm)
head(order(-Boston_leverage), 10)</span>
```

```
# Multiple linear regression model with two predictors
# Median house value is the response (Y)
# Percentage of low income households in the neighborhood is the first predictor (X1)
# Percentage of houses in the neighborhood built before 1940 is the second predictor (X2)
Boston_lm_mult_1 = lm(medv ~ lstat + age, data=Boston)
summary(Boston_lm_mult_1)
## Coefficients:
## (Intercept) lstat age
## 33.22276 -1.03207 0.03454
# Multiple linear regression model with all predictors
# Median house value is the response (Y)
# Every variable in the Boston dataset is a predictor (X)
Boston_lm_mult_2 = lm(medv ~ ., data=Boston)
# Multiple linear regression model with all predictors except specified (age)
Boston_lm_mult_3 = lm(medv ~ . -age, data=Boston)
# Multiple linear regression model with an interaction term
Boston_lm_mult_4 = lm(medv ~ crim + lstat:age, data=Boston)
# The colon ":" will include an (lstat)(age) interaction term
Boston_lm_mult_5 = lm(medv ~ crim + lstat*age, data=Boston)
# The asterisk "*" will include an (lstat)(age) interaction term
# It will also include the terms by themselves, without having to specify separately
# Multiple linear regression model with nonlinear transformation
Boston_lm_mult_6 = lm(medv ~ lstat + I(lstat^2), data=Boston)
# Multiple linear regression model with nonlinear transformation (easier method)
Boston_lm_mult_7 = lm(medv ~ poly(lstat, 5), data=Boston)
# Multiple linear regression model with log transformation
Boston_lm_mult_8 = lm(medv ~ log(rm), data=Boston)
# ANOVA test to compare two different regression models
anova(Boston_lm_mult, Boston_lm_mult_6)
# Null Hypothesis: Both models fit the data equally well
# Alternative Hypothesis: The second model is superior to the first
# 95% confidence interval for the coefficients
confint(Boston_lm_mult_1, level=0.95)</span>
```

```
# The Carseats data has a qualitative variable for the quality of shelf location
# It takes on one of three values: Bad, Medium, Good
# R automatically generates dummy variables ShelveLocGood and ShelveLocMedium
# Multiple linear regression model to predict carseat sales
Carseats_lm_mult = lm(Sales ~ ., data=Carseats)
summary(Carseats_lm_mult)
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## ...
## ShelveLocGood 4.8501827 0.1531100 31.678 < 2e-16 ***
## ShelveLocMedium 1.9567148 0.1261056 15.516 < 2e-16 ***
## ...</span>
```

]]>This is a summary of chapter 2 of the *Introduction to Statistical Learning* textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at my website, or here at Hashnode. Subscribe to stay up to date on my latest Data Science & Engineering guides!

Assume that we have an advertising dataset that consists of TV advertising spend, radio advertising spend, newspaper advertising spend, and product sales.

We could build a multiple linear regression model to predict sales by using the different types of advertising spend as predictors.

- Product sales would be the response variable (
*Y*)

The different advertising spends would be the predictors (*X*₁*, X*₂*, X*₃)

The model could be written in the following general form:

*Y* = *f*(*X*) + *e*

The symbol *f* represents the systematic information that *X* provides about *Y*. Statistical learning refers to a set of approaches for estimating *f*.

There are two main reasons for estimating *f*: prediction and inference.

**Prediction**

Prediction refers to any scenario in which we want to come up with an estimate for the response variable *Y*.

**Inference**

Inference involves understanding the relationship between *X* and *Y*, as opposed to predicting *Y*. For example, we may ask the following questions:

- Which predictors
*X*are associated with the response variable*Y*? - What is the relationship between the predictor and response?
- What type of model best explains the relationship?

There are two main methods of estimating *f*: parametric and nonparametric.

**Parametric**

Parametric methods involve the following two-step model-based approach:

- Make an assumption about the function form (linear, lognormal, etc.) of
*f* - After selecting a model, use training data to fit or train the model

**Nonparametric**

On the other hand, nonparametric methods do not make explicit assumptions about the functional form of *f*.

Instead, they attempt to get as close to the data points as possible, without being too rough or too smooth. They have the potential to accurately fit a wider range of possible shapes for *f*. Spline models are common examples of nonparametric methods.

**Parametric vs Nonparametric**

Nonparametric methods have the potential to be more accurate due to being able to fit a wider range of shapes. However, they are more likely to suffer from the issue of overfitting to the data.

Additionally, parametric methods are much more interpretable than nonparametric methods. For example, it is much easier to explain the results of a linear regression model than it is to explain the results of a spline model.

In statistical learning, we’re not only interested in the type of model to fit to our data, but also how well the model fits the data.

One way to determine how well a model fits is by comparing the predictions to the actual observed data.

In regression, the most commonly used method to assess model accuracy is the mean squared error (MSE):

*n*— represents the number of observations in the data*y*— represents the actual response value in the data*y-hat*— represents the predicted response value

However, we should usually be interested in the error of the test dataset instead of the training dataset. This is because we usually want the model that best predicts the future, and not the past. Additionally, more flexible models will reduce training error, but may not necessarily reduce test error. Therefore, the test error should be of higher concern.

In general, to minimize the expected test error, a model that has low variance and low bias should be chosen.

Bias refers to the error that is introduced by modeling a very complicated problem with a simplistic model. More flexible models have less bias because they are more complex.

Variance refers to how good a model is if it was used on a different training dataset. More flexible models have higher variance.

The relationship among bias, variance, and the test error is known as the bias-variance tradeoff. The challenge lies in finding the method in which both bias and variance are low.

In classification, the most commonly used method to assess model accuracy is the error rate (the proportion of mistakes made):

However, again, we should usually be interested in the error rate of the test dataset instead of the training dataset.

The test error rate can be minimized through a very simple classifier known as the Bayes’ classifier. The Bayes’ classifier assigns classes based on a known predictor value.

For example, assume that we knew for a fact that 70% of all people who make more than $100,000 per year were STEM graduates.

The Bayes’ classifier predicts one class if the probability of a certain class is greater than 50%. Otherwise, it predicts the other class. The Bayes’ decision boundary is the line at which the probability is exactly 50%. The classifier predictions are based on this boundary.

For example, if we were given a test dataset of just salary values, we’d simply assign any salaries greater than $100,000 as STEM graduates, and salary values less than $100,000 as non-STEM graduates.

The Bayes’ classifier produces the lowest possible test error rate, called the Bayes’ error rate.

In theory, we would always like to predict classifications using the Bayes’ classifier. However, we do not always know the probability of a class, given some predictor value. We have to estimate this probability, and then classify the data based on the estimated probability.

The K-Nearest Neighbors (KNN) classifier is a popular method of estimating conditional probability. Given a positive integer *K* and some test observation, the KNN classifier identifies the *K* points in the training data that are closest to the test observation. These closest *K* points are represented by *N*₀. Then, it estimates the conditional probability for a class as the fraction of points in *N*₀ that represent that specific class. Lastly, KNN will apply the Bayes’ rule and classify the test observation to the class with the largest probability.

However, the choice of the *K* value is very important. Lower values are more flexible, whereas higher values are less flexible but have more bias. Similar to the regression setting, a bias-variance tradeoff exists.

*Originally published at* https://www.bijenpatel.com/ *on August 2, 2020.*

I will be releasing the equivalent Python code for these examples soon. Subscribe to get notified!

```
# Assign a vector to a new object named "data_vector"
data_vector = c(1, 2, 3, 4)
# List of all objects
ls()
# Remove an object
rm(data_vector)
# Create two vectors of 50 random numbers
# with mean of 0 and standard deviation of 1
random_1 = rnorm(50, mean=0, sd=1)
random_2 = rnorm(50, mean=0, sd=1)
# Create a dataframe from the vectors
# with columns labeled as X and Y
data_frame = data.frame(X=random_1, Y=random_2)
# Get the dimensions (rows, columns) of a dataframe
dim(data_frame)
## 50 2
# Get the class types of columns in a dataframe
sapply(data_frame, class)
## X Y
## "numeric" "numeric"
# Easily omit any NA values from a dataframe
data_frame = na.omit(data_frame)
# See the first 5 rows of the dataframe
head(data_frame, 5)
## X Y
## 1 1.3402318 -0.2318012
## 2 -1.8688186 1.0121503
## 3 2.9939211 -1.7843108
## 4 -0.9833264 -1.0518947
## 5 -1.2800747 -0.4674771
# Get a specific column (column X) from the dataframe
data_frame$X
## 1.3402318 -1.8688186 2.9939211 -0.9833264 -1.2800747
# Get the mean, variance, or standard deviation
mean(random_1)
## -0.1205189
var(random_1)
## 1.096008
sd(random_1)
## 1.046904</span>
```

```
# Create a scatter plot using the ggplot2 package
plot_1 = ggplot(data_frame, aes(x=data_frame$X, y=data_frame$Y)) +
geom_point() +
coord_cartesian(xlim=c(-3, 3), ylim=c(-3, 3)) +
labs(title="Random X and Random Y", subtitle="Random Numbers", y="Y Random", x="X Random") +
scale_x_continuous(breaks=seq(-3, 3, 0.5)) +
scale_y_continuous(breaks=seq(-3, 3, 0.5)) +
geom_point(colour="blue") +
geom_point(aes(col=data_frame$X))
# Visualize the ggplot
plot(plot_1)</span>
```

]]>This is a summary of chapter 1 of the *Introduction to Statistical Learning* textbook. I’ve written a 10-part guide that covers the entire book. The guide can be read at my website, or here at Hashnode. Subscribe to stay up to date on my latest Data Science & Engineering guides!

Statistical learning simply refers to the broad set of tools that are available for understanding data. There are two main types of statistical learning: supervised and unsupervised.

Supervised learning involves building statistical models to predict outputs (*Y*) from inputs (*X*). For example, assume that we have a salary dataset for statisticians. The dataset consists of the experience level and salary for 10 different statisticians.

We could build a simple linear regression model to predict the salary of statisticians by using experience level as a predictor. This is an example of supervised learning, where we have supervising outputs (salary values) that guide us in developing a statistical model to determine the relationship between experience level and salary.

In general, there are two main types of supervised learning: regression and classification.

Predicting a quantitative output is known as a regression problem. For example, predicting someone’s salary is a regression problem.

Predicting a qualitative output is known as a classification problem. For example, predicting whether a stock will go up or down is a classification problem.

Unsupervised learning involves building statistical models to determine relationships from inputs (*X*). There are no supervising outputs. For example, assume that we have a customer dataset. The dataset consists of the annual salary and annual spend on Amazon for 10 different individuals.

We could use a statistical clustering algorithm to group customers by their purchasing behavior. This is an example of unsupervised learning, where we **do not** have supervising outputs that already inform us which customers are low spenders, average spenders, or high spenders. Instead, we have to come up with the determination ourselves.

In general, there are two main types of unsupervised learning: clustering and association.

Determining groupings is known as a clustering problem. For example, grouping customers together based on purchasing behavior is a clustering problem.

Determining rules that describe large portions of a dataset is known as an association problem. For example, determining that people who buy *X* also buy *Y* is an association problem. A modern real-world example of this is Amazon’s “frequently bought together” product recommendations.

*Originally published at* https://www.bijenpatel.com *on August 1, 2020.*