Density Estimation

Density Estimation

May 17, 2019
Wanna know more about histograms and density plots?
We want to learn about the distribution of the population from which the data were drawn. More specifically, we want to formally estimate the shape of the distribution, i.e. get a “reliable” visual representation such as a histogram.


Subintervals of the histogram are called bins. The width of the interval is called binwidth. Small binwidth leads to more bins and shows local details, which may or may not be meaningful. Large binwidth shows a smoother, large-scale picture, but we may lose interesting information. Here are some histograms of a random sample of size 200 generated from N(0, 1)1. We have a tradeoff between competing goals.
R code for generating the figure.
library(ggpubr) library(patchwork) set.seed(42) dat <- data.frame(x = rnorm(200, 0, 1)) p1 <- gghistogram(dat, x = "x", bins = 10, title = "10 Bins") p2 <- gghistogram(dat, x = "x", bins = 20, title = "20 Bins") p3 <- gghistogram(dat, x = "x", bins = 50, title = "50 Bins") p4 <- gghistogram(dat, x = "x", bins = 100, title = "100 Bins") (p1 + p2) / (p3 + p4)
A shoutout for the R package patchwork!
Histograms of the same sample with different numbers of bins.
Histograms of the same sample with different numbers of bins.
The histogram has some drawbacks:
  1. Histograms are not smooth even if the underlying distribution is continuous - binning discretizes the result.
  1. Histograms are sensitive to the choice of the class interval.
  1. (related) Histograms depend on the choice of endpoints of the intervals. Both 2 and 3 are about the visual appearance.

Kernel density estimation

A smoother approach which gets around some of the drawbacks of the histogram is called kernel density estimation. It gives a continuous estimate of the distribution. It also removes dependence on endpoints, but the choice of binwidth (drawback 2) has an analogous issue here.


For any in a local neighborhood of each data value , fitting is controlled in a way that depends on the distance of from . Close-by points are weighted more. As the distance increases, the weight decreases.
The weights are determined by a function called the kernel, which has an associated bandwidth. The kernel, , determines how weights are calculated, and the bandwidth, or , determines scaling, or how near/far points are considered “close” enough to matter.
Let denote the kernel density estimator of , the PDF of the underlying distribution. is defined as
where is the sample size, is the kernel, and is the observed data.
To be called a kernel, has to satisfy certain properties: is a smooth function such that
The first two constraints make it a density, and the second set of constraints ensures it has mean 0 and has a variance.

Commonly used kernels

Here we give four commonly used kernels. For more graphical representations, see this Wikipedia page.
The boxcar kernel can be expressed as a uniform distribution. The tricube kernel is narrower than the previous ones. In all the formulae above, the indicator function is given by
A visualization of the four kernels is given below.
library(tidyr) library(ggplot2) indicator <- function(x) { ifelse(abs(x) <= 1, 1, 0) } gaussian.kernel <- function(x) { exp(-x^2/2) / sqrt(2 * pi) } epanechnikov.kernel <- function(x) { 0.75 * (1 - x^2) * indicator(x) } tricube.kernel <- function(x) { 70 * (1 - abs(x)^3)^3 * indicator(x) / 81 } x <- seq(-1.5, 1.5, 0.01) dat <- data.frame( x = x, Boxcar = 0.5 * indicator(x), Gaussian = gaussian.kernel(x), Epanechnikov = epanechnikov.kernel(x), Tricube = tricube.kernel(x) ) %>% gather(key = "Kernel", value = "Density", -x) ggline(dat, x = "x", y = "Density", color = "Kernel", palette = "jco", plot_type = "l", numeric.x.axis = T)
Four commonly used kernel functions.
Four commonly used kernel functions.

Bandwidth selection

In practice, the choice of the kernel isn’t that important, but the choice of is crucial: it controls the smoothness of the kernel density estimator. When is small, is more “wiggly” and shows local features. When is large, is “smoother”, same as the binwidth in histograms.
R code for the visual:
library(ggplot2) set.seed(42) x <- seq(-3, 3, 0.01) true.density <- 1/sqrt(2 * pi) * exp(-x^2 / 2) random.x <- rnorm(length(x), 0, 1) dat <- data.frame(random.x = random.x) ggplot(dat, aes(random.x))+ geom_density(bw = 0.02, color = "#CD534C")+ geom_density(bw = 0.2, color = "#0073C2")+ geom_density(bw = 2, color = "#EFC000")+ geom_line(aes(x = x, y = true.density), color = "#404040", size = 1)+ xlab("x")+ ylab("Density")+ theme_minimal()
Kernel density estimation of  with sample size 200.
Kernel density estimation of with sample size 200.
We have a random sample of size 200 generated from shown by the black line. The red, blue and yellow lines are kernel density estimations with bandwidth 0.02, 0.2 and 2, respectively. We can see a bandwidth of 0.2 gives a relatively close approximation of the true density, and the other two choices are either undersmoothed or oversmoothed.
Many criteria and methods have been proposed to choose the value for :
  1. Define a criterion for a “successful” smooth. Try a range of values and see which is the best according to that criterion.
  1. Use cross-validation to pick . Split the data set into pieces, and try to fit on each piece. Use this to get predictions, and pick the that does well over the cross-validation.
  1. A general rule of thumb takes where , is the sample size, is the sample standard deviation, and is the interquartile range .
According to Wassesman in his book “All of Nonparametric Statistics”, the main challenge in smoothing is picking the right amount. When we over-smooth ( is too large), the result is biased, but the variance is small. When we under-smooth, the bias is small but the variance is large. This is called the bias-variance tradeoff.
We want to actually minimize the squared error risk (more commonly known as the mean square error), which is bias + variance, so that we get a balance between the two aspects.