We want to learn about the distribution of the population from which the data were drawn. More specifically, we want to formally

**estimate the shape**of the distribution, i.e. get a “reliable” visual representation such as a histogram.### Histogram

Subintervals of the histogram are called

**bins**. The width of the interval is called**binwidth**. Small binwidth leads to more bins and shows local details, which may or may not be meaningful. Large binwidth shows a smoother, large-scale picture, but we may lose interesting information. Here are some histograms of a random sample of size 200 generated from*N*(0, 1)1. We have a tradeoff between competing goals.## R code for generating the figure.

`library(ggpubr) library(patchwork) set.seed(42) dat <- data.frame(x = rnorm(200, 0, 1)) p1 <- gghistogram(dat, x = "x", bins = 10, title = "10 Bins") p2 <- gghistogram(dat, x = "x", bins = 20, title = "20 Bins") p3 <- gghistogram(dat, x = "x", bins = 50, title = "50 Bins") p4 <- gghistogram(dat, x = "x", bins = 100, title = "100 Bins") (p1 + p2) / (p3 + p4)`

The histogram has some drawbacks:

- Histograms are not smooth even if the underlying distribution is continuous - binning discretizes the result.

- Histograms are sensitive to the choice of the class interval.

- (
*related*) Histograms depend on the choice of endpoints of the intervals. Both 2 and 3 are about the visual appearance.

### Kernel density estimation

A smoother approach which gets around some of the drawbacks of the histogram is called

**kernel density estimation**. It gives a**continuous**estimate of the distribution. It also removes dependence on endpoints, but the choice of binwidth (drawback 2) has an analogous issue here.#### Procedure

For any in a

**local neighborhood**of each data value , fitting is controlled in a way that depends on the**distance**of from . Close-by points are weighted more. As the distance increases, the weight decreases.The weights are determined by a function called the

**kernel**, which has an associated**bandwidth**. The kernel, , determines how weights are calculated, and the bandwidth, or , determines scaling, or how near/far points are considered “close” enough to matter.Let denote the

**kernel density estimator**of , the PDF of the underlying distribution. is defined aswhere is the sample size, is the kernel, and is the observed data.

To be called a kernel, has to satisfy certain properties: is a smooth function such that

The first two constraints make it a density, and the second set of constraints ensures it has mean 0 and has a variance.

#### Commonly used kernels

Here we give four commonly used kernels. For more graphical representations, see this Wikipedia page.

Kernel | Expression |

Boxcar | |

Gaussian | |

Epanechnikov | |

Tricube |

The boxcar kernel can be expressed as a uniform distribution. The tricube kernel is narrower than the previous ones. In all the formulae above, the indicator function is given by

## A visualization of the four kernels is given below.

`library(tidyr) library(ggplot2) indicator <- function(x) { ifelse(abs(x) <= 1, 1, 0) } gaussian.kernel <- function(x) { exp(-x^2/2) / sqrt(2 * pi) } epanechnikov.kernel <- function(x) { 0.75 * (1 - x^2) * indicator(x) } tricube.kernel <- function(x) { 70 * (1 - abs(x)^3)^3 * indicator(x) / 81 } x <- seq(-1.5, 1.5, 0.01) dat <- data.frame( x = x, Boxcar = 0.5 * indicator(x), Gaussian = gaussian.kernel(x), Epanechnikov = epanechnikov.kernel(x), Tricube = tricube.kernel(x) ) %>% gather(key = "Kernel", value = "Density", -x) ggline(dat, x = "x", y = "Density", color = "Kernel", palette = "jco", plot_type = "l", numeric.x.axis = T)`

#### Bandwidth selection

In practice, the choice of the kernel isn’t that important, but the choice of is

**crucial**: it controls the**smoothness**of the kernel density estimator. When is small, is more “wiggly” and shows local features. When is large, is “smoother”, same as the binwidth in histograms.## R code for the visual:

`library(ggplot2) set.seed(42) x <- seq(-3, 3, 0.01) true.density <- 1/sqrt(2 * pi) * exp(-x^2 / 2) random.x <- rnorm(length(x), 0, 1) dat <- data.frame(random.x = random.x) ggplot(dat, aes(random.x))+ geom_density(bw = 0.02, color = "#CD534C")+ geom_density(bw = 0.2, color = "#0073C2")+ geom_density(bw = 2, color = "#EFC000")+ geom_line(aes(x = x, y = true.density), color = "#404040", size = 1)+ xlab("x")+ ylab("Density")+ theme_minimal()`

We have a random sample of size 200 generated from shown by the black line. The red, blue and yellow lines are kernel density estimations with bandwidth 0.02, 0.2 and 2, respectively. We can see a bandwidth of 0.2 gives a relatively close approximation of the true density, and the other two choices are either undersmoothed or oversmoothed.

Many criteria and methods have been proposed to choose the value for :

- Define a criterion for a “successful” smooth. Try a range of values and see which is the best according to that criterion.

- Use
**cross-validation**to pick . Split the data set into pieces, and try to fit on each piece. Use this to get predictions, and pick the that does well over the cross-validation.

- A general rule of thumb takes where , is the sample size, is the sample standard deviation, and is the interquartile range .

According to Wassesman in his book “All of Nonparametric Statistics”, the main challenge in smoothing is picking the right amount. When we over-smooth ( is too large), the result is

**biased**, but the**variance is small**. When we under-smooth, the bias is small but the variance is large. This is called the**bias-variance tradeoff**.We want to actually minimize the

**squared error risk**(more commonly known as the mean square error), which is bias + variance, so that we get a balance between the two aspects.