Modern Nonparametric Regression

Modern Nonparametric Regression

May 20, 2019
LOWESS, penalized least squares and the cubic spline.
In regression, we also face the problem of underfitting / overfitting.
The red line is severe oversmoothing, and the blue line is severe undersmoothing.
The red line is severe oversmoothing, and the blue line is severe undersmoothing.
R code for the visual:
library(tidyverse) library(ggpubr) set.seed(42) df <- tibble( x = seq(-3, 3, by = 0.1), y = dnorm(x, 0, 1) + rnorm(length(x), 0, 0.1) ) ggline(df, x = "x", y = "y", color = "#0073C2", numeric.x.axis = T)+ geom_smooth(color = "#EFC000", method = "loess", se = F)+ geom_abline(slope = 0, intercept = mean(df$y), color = "#CD534C")
Here under-smoothing is interpolation, or connecting the dots. It gives a perfect fit to the data but is useless for anything else. Over-smoothing is ignoring altogether, and fitting as the predicted value everywhere. It’s useless for inference. The question is can we use the data to learn about the relationship between and ?
We’ll focus on a single explanatory variable. The data is . A simple linear model
relates and in a linear fashion. In standard parametric approaches, we assume .
We don’t want to make the normality (or any other distributional) assumption on . We also don’t want to restrict ourselves to a linear fit, which won’t always be appropriate. The more general form is


assuming , is unspecified, and we want to learn this from the data. We’re being model-free and distribution-free.
A popular method is called “locally weighted scatterplot smoothing” - LOWESS, or LOESS for “locally estimated scatterplot smoothing”. The basic idea if to fit a linear (or quadratic) polynomial at each data point using “weighted least squares” (does parameter estimation with different data points weighted differently), with weights that decrease as we move away from the xi at which we are fitting. Points more than a specified distance away (controlled by the span) are assigned a zero weight. The weight function is maximum at the xi value at which the fit is being made, and is symmetric about that point.
Things to consider include:
  • : degree of polynomial to be fitted at each point (usually 1 or 2).
  • : span, i.e. the width of the neighborhood.
  • : the weight function.
With these in hand, a localized (in the neighborhood), weighted least squares procedure is used to get an estimate of at , call it . The window slides across each , and fits a (weighted) linear () or quadratic () in each window.
R code for the figure:
library(tidyverse) library(ggpubr) set.seed(42) df <- tibble( x = seq(-3, 3, by = 0.1), y = dnorm(x, 0, 1) + rnorm(length(x), 0, 0.1), point.type = ifelse(x > -1.5 & x < -0.5, "Neighborhood", "Distant") ) ggscatter(df, "x", "y", color = "point.type", palette = "jco") + geom_smooth(data = subset(df, x > -1.5 & x < -0.5), method = "lm")
Blue points are within the neighborhood of . An OLS regression is fit on these points.
Blue points are within the neighborhood of . An OLS regression is fit on these points.
We’re looking for the general pattern, not (necessarily) local structure. This guides us in the choice of bandwidth and (less critical). Often, within a small neighborhood, functions will be roughly linear (or quadratic), so usually is good enough. Note that if the function is very unsmooth (lots of quick up and downs), this approach won’t work well.
Choice of neighborhood size (bandwidth / span) is a bit of trial and error as it depends on . R default is of the data included in each sliding window, which will often oversmooth. As the span increases, the smoothing of the curve also increases.
For the weighting kernel (not as crucial), a popular choice is
This gives weights for weighted least squares. For points within the neighborhood, if the farthest points to be included in the “local” neighborhood are at a distance from , then (in the neighborhood) gets weight
This is done at each data point, so we get total fits. Another possibility is to robustify for the procedure by down-weighting possible outliers. In R, the lowess() function takes the following default values:
  • Window of of data controlled by argument f, the smoother span.
  • 3 iterations of robustification controlled by argument iter.
There’s also a formula-based loess() function which is pretty similar, but has some different defaults. Note that ggplot2::geom_smooth() uses this version of LOESS.
  • span is by default 0.75.
  • degree of the polynomials d is 2 by default.

Penalized least squares

Recall our original problem: we had a generic model of
Suppose, in analogy to simple linear regression, we estimate by choosing to minimize the sum of squares
In the linear case, would be . Minimizing the sum of squares over all linear functions yields the usual least squares estimator, which is possibly an oversmooth. Minimizing over all functions yields the “connect the dots” estimator of exact interpolation, which is definitely an undersmooth.
Lowess uses the local weighting to balance between these two extremes. A different approach to getting this balance is to minimize a penalized sum of squares:
where is the tuning parameter, and is the penalty function for roughness. A common choice for is
which is the integral of the squared second derivative of and it measures the curvature in at . controls the amount of smoothing - the larger it is, the more smoothing is enforced.

Cubic spline

We want “optimal” solution to the equation. Let be a set of ordered points, or knots, contained in some interval . A cubic spline is defined as a continuous function such that:
  1. is a cubic polynomial over each ...
  1. has continuous first and second derivatives at the knots, which ensures smoothness of the curves at the knots.
In general, an -order spline is a piecewise degree polynomial with continuous derivatives at the knots. The simplest spline has degree 0 and is a step function. A spline with degree 1 is called a linear spline.
A spline that is linear beyond the boundary knots ( and ) is called a natural spline. In mathematical terms, the second derivatives of the spline polynomials are set to 0 and the endpoints of the interval .
The cubic spline () are the most commonly used and solve our penalized problem. No discontinuities (or jumps) at the knots means that the functions connect smoothly, and edge effects beyond and are handled through the linear fit. The function that minimizes with penalty is a natural cubic spline with knots at the data points. Our estimator is called a smoothing spline.
What does look like? We can construct a basis for a set of splines:
where if . The set is a basis for the set of cubic splines at these knots, called the truncated power basis. This means any cubic spline with these same knots can be written as
where is known, and we just need to estimate e.g. by least squares. There are also other sets of basis functions.
In R, there’s a library splines.
  • smoothing.splines() places knots at each data point and performs smoothing splines.
  • ns() is for general natural splines, and we need to specify the number and locations of knots, which is not an easy question to answer!
  • bs() generates a basis function set for splines which we can then use for other things. This also requires specifying the number and locations of knots.


  1. , the tuning parameter, controls the roughness of the fit, so it also somewhat shrinks the basis functions.
  1. We need to be aware of and account for edge effects. The fit isn’t going to look as good at the ends.
  1. controls the bias-variance tradeoff in the spline setting.
      • no effect of the penalty term exact interpolation low bias, high variance.
      • perfectly smooth (linear) fit a line that passes as close to the data as possible high bias, low variance.
      • Intermediate values of balance the two, and we get a reasonably smooth function that passes near the data.