asb: head /dev/brain > /dev/www

My home, musings, and wanderings on the world wide web.

EDA: Plotting least squares fit line in R

I have recently started reading ISLR and am finding the plots in the book very useful.

A visualization aid one often uses for exploratory data analysis is a scatter plot of the response variable against a potential predictor. Overlaying the ordinary least squares fit line on this scatter provides a readily accessible visual representation of the effect of the predictor on the response (if any).

Following is a simple snippet that I wrote in R to plot such graphs for any arbitrary dataset with some numeric response variable. Note that the function only attempts the plots for predictors which are numeric (or integer). It also attempts a crude adjustment of the layout of the plot according to the number of predictors.

Plotting OLS fit of features against the response
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
plotLeastSqFit = function(df, responseVar) {
    stopifnot(is.data.frame(df), responseVar %in% colnames(df), is.numeric(df[[responseVar]]))
    areNumeric = setdiff(colnames(df)[sapply(df, is.numeric)], responseVar)
    if (length(areNumeric) <= 3) {
        mfRow = c(1, length(areNumeric))
    } else {
        mfRow = c(ceiling(length(areNumeric)/2), 2)
    }
    par(mfrow = mfRow)
    lapply(X = areNumeric, FUN = function(x) {
        plot(y = df[[responseVar]], x = df[[x]], col = "red", lwd = 1.5, ylab = responseVar,
            xlab = x, main = sprintf("LS fit of %s against %s", responseVar,
                x))
        abline(lm(as.formula(paste(responseVar, "~", x)), data = df), col = "blue",
            lwd = 2)
    })
}

Here are sample plots from this function for a couple of the ISLR datasets.

For the mtcars dataset
1
2
3
4
5
library(ISLR)

data(mtcars)
## Choose only a few columns.
plotLeastSqFit(df = mtcars[c("mpg", "cyl", "hp", "wt")], responseVar = "mpg")

plot of chunk For the mtcars dataset

For the Advertising dataset, Ch. 2, ISLR
1
2
3
4
5
Advertising = read.csv("http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv")
Advertising[["X"]] = NULL
set.seed(pi)
Advertising[["random"]] = runif(nrow(Advertising))
plotLeastSqFit(df = Advertising, responseVar = "Sales")

plot of chunk For the Advertising dataset, Ch. 2, ISLR