MultiKDE.jl: A Lazy Evaluation Multivariate Kernel Density Estimator


MultiKDE.jl

I am excited to announce the release of MultiKDE.jl, a Julia library for lazy-evaluated multivariate kernel density estimation.


Kernel Density Estimation

Kernel density estimation (KDE) is a classic technique in the field of nonparametric statistics. The objective of KDE is to estimate a distribution given some observations without a closed-form assumption. Kernel function is required for KDE, a kernel function is essentially a probabilistic mass function, but in this scenario used for estimating an unknown distribution. In most cases, it is desirable for the kernel function to have two properties:

  1. Normalization: $\int_{-\infty}^{\infty} K(u) du = 1$, whose integration sums to one (as we will discuss later).
  2. Symmetry: $K(u) = -K(u)$, which means that the influence of one point should be symmetric with respect to surrounding points.
    Imagine we have $D=\{x_{1}, x_{2}, … ,x_{n}\}$, a set of observations of random variable $X$. We want to guess probability distribution of $X$ over kernel function $K$, then using KDE, probability mass at point $x$ is estimated as $\widehat{P}(X=x)=\frac{1}{nh}\Sigma_{i}^{n} K(\frac{x-x_{i}}{h})$, there $h$ is a hyperparameter named bandwidth that controls how $D$ influences $\widehat{P}$, $\frac{1}{nh}$ is a normalization term to make sure $\int_{-\infty}^{\infty} \widehat{P}(X=u)du = \int_{-\infty}^{\infty} K(u)du$. Easy to verify the fact that when $K$ is normalized, $\widehat{P}$ is also normalized.

Implementation

Although there are several nice KDE implementations in Julia, like KernelDensity.jl and KernelDensityEstimate.jl. I felt it was necessary to develop another implementation while I was trying to implement BOHB. Because although very remarkable, they don’t support several needed features very well:

  1. Lazy Evaluation: We don’t want to evaluate the whole density when initialize it. Instead, when initialize, we want to just keep the data on file without any calculation, then evaluate $\widehat{P}(X=\hat{x})$ on a specific $\hat{x}$ only when we need it. This is because in BOHB, the observations can change frequently as the algorithm evolves. Performing a complete evaluation would be unnecessary and time-consuming.
  2. Multidimension: BOHB is a hyperparameter optimization algorithm that using Tree Parzen Estimator. When sampling hyperparameters, it splits the already evaluated hyperparameters half-and-half with respect to their performance. Then we fit two KDEs over them: KDE-good, which uses the better-performing half, and KDE-bad, which uses the worse-performing half. Note that all the observations are combination of hyperparameters so we need to treat the KDE as a multi-dimensional function. Here we use product kernel refers to the implementation of statsmodels. The original paper of BOHB mentioned they used this library.
  3. Ordered and Unordered Categorical Random Variable: For hyperparameter optimization, sometimes the input space can be categorical (like number of threads to use, number of layers in model) or unordered categorical (like choose which optimization method). We need to support KDE of this kind of variables. Specifically, statsmodels uses two kernel functions that specialized for them, names them using last name of authors as Wang-Ryzin and Aitchson-Aitken.

DEMOs [Code]


Demo 1: KDE visualization of 50 random observations from a standard normal distribution $\mathcal{N}(0, 1)$ N(0,1)) using a Gaussian kernel, with different bandwidths. A smaller $h$ makes the curve fluctuated and more sensitive to observations, and vice versa.


Demo 2: Same setting as above, but 2-dimensional $\mathcal N(\begin{bmatrix} 0\\0 \end{bmatrix}, \mathrm{diag}(2))$ version.


Reference(s):

[1] Nonparametric statistics, Kernel, Hyperparameter optimization, Wikipedia.
[2] BOHB: Robust and Efficient Hyperparameter Optimization at Scale, Falkner et al. 2018.
[3] Algorithms for Hyper-Parameter Optimization, Bergstra et al. 2011.
[4] Kernel density estimation slides, NCCU.