So continuing with the beautiful plots in ISLR, here is a
discussion I had on SO today about hot to plot decision boundaries or arbitrary
non-linear curves. The discussion on SE that was linked to in
the answer was even more useful. Plus, I picked up two new functions today:
curve
and contour
.
ISLR: Notes - Chapter 2
-
Non-parametric methods seek an estimate of $f$ that gets as close to the data points as possible without being too rough or wiggly. Non-parametric approaches completely avoid the danger of the chosen functional form being too far from the true $f$. The disadvantage of non-parametric methods is that they need a large set of observations to obtain an accurate estimate of $f$. Therefore, the informational requirements of non-parametric methods are larger.
Contrast this to parametric methods. Parametric methods essentially extrapolate information from one region of the domain to another. This is because global regularities are assumed in the functional form. A non-parametric method however has to trace the surface $f$ in all regions of the domain to be valid.
EDA: Plotting least squares fit line in R
I have recently started reading ISLR and am finding the plots in the book very useful.
A visualization aid one often uses for exploratory data analysis is a scatter plot of the response variable against a potential predictor. Overlaying the ordinary least squares fit line on this scatter provides a readily accessible visual representation of the effect of the predictor on the response (if any).
Following is a simple snippet that I wrote in R to plot such graphs for any arbitrary dataset with some numeric response variable. Note that the function only attempts the plots for predictors which are numeric (or integer). It also attempts a crude adjustment of the layout of the plot according to the number of predictors.
R: Finding identifier variables
I had never expected such a problem, much less a solution, to exist till I was asked yesterday to solve it. The problem statement: given a dataset and a list of candidate variables, find which minimal combination, if any, is a valid identifier for the observations in the dataset.
The corrected AIC
Only today I discovered that the Akaike Information Criterion is valid only asymptotically and that there exists a correction (in fact, a strongly recommended correction) for finite samples. Here is a quick copy-paste from Wikipedia.