22 June 2021

MultiKDE.jl: A Lazy Evaluation Multivariate Kernel Density Estimator

by Ping

MultiKDE.jl [link]

A lazy-evaluated multivariate kernel density estimation library for Julia.

Kernel Density Estimation

Kernel density estimation, as known as KDE, is a classic algorithm belonging to Nonparametric statistics. The objective of KDE is to estimate a distribution given some observations without a closed-form assumption. Kernel function is required for KDE, a kernel function is essentially a probabilistic mass function, but in this scenario used for estimating an unknown distribution. In most cases, it’ll be good if kernel function have two properties:

Normalization: \(\int_{-\infty}^{\infty} K(u) du = 1\), whose integration sums to one (will mention that later).
Symmetry: \(K(u) = -K(u)\), which means as a kernel, the influence of one point better be symmetric for surrounding points.

Imagine we have \(D=\{x_{1}, x_{2}, \ldots, x_{n}\}\), a set of observations of random variable \(X\). We want to guess the probability distribution of \(X\) over kernel function \(K\), then using KDE, probability mass at point \(x\) is estimated as \(\widehat{P}(X=x)=\frac{1}{nh}\Sigma_{i}^{n} K(\frac{x-x_{i}}{h})\), there \(h\) is a hyperparameter named bandwidth that controls how \(D\) influences \(\widehat{P}\), \(\frac{1}{nh}\) is a normalization term to make sure \(\int_{-\infty}^{\infty} \widehat{P}(X=u)du = \int_{-\infty}^{\infty} K(u)du\), then easy to realize that when \(K\) is normalized, \(\widehat{P}\) is normalized as well!

Implementation

Although there are several nice KDE implementations in Julia, like KernelDensity.jl and KernelDensityEstimate.jl. I feel is it necessary to have another implementation during I tried to implement BOHB. Because although very remarkable, they don’t support several needed features very well:

Lazy Evaluation: We don’t want to evaluate the whole density when initialize it. Instead, when initialize, we want to just keep the data on file without any calculation, then evaluate \(\widehat{P}(X=\hat{x})\) on a specific \(\hat{x}\) only when we need it. This is because in BOHB, the observations can change frequently as the algorithm evolves. An entire evaluation should be unnecessary and time-wasting.
Multidimension: BOHB is a hyperparameter optimization algorithm that using Tree Parzen Estimator. When sampling hyperparameters, it splits the already evaluated hyperparameters half-and-half with respect to their performance. Then we fit two KDEs over them: KDE-good, which uses better performance half and KDE-bad, which uses worser performance half. Note that all the observations are combination of hyperparameters so we need to treat the KDE as a multi-dimensional function. Here we use product kernel refers to the implementation of statsmodels. The original paper of BOHB mentioned they used this library.
Ordered and Unordered Categorical Random Variable: For hyperparameter optimization, sometimes the input space can be categorical (like number of threads to use, number of layers in model) or unordered categorical (like choose which optimization method). We need to support KDE of this kind of vairiables. Specifically, statsmodels uses two kernel functions that specialized for them, names them using last name of authors as Wang-Ryzin and Aitchson-Aitken.

DEMOs [Code]

Demo 1: KDE visualization over 50 random observations from \( \mathcal{N}(0, 1) \) using gaussian kernel, with difference bandwidths. A smaller \(h\) makes the curve fluctuated and more sensitive to observations, and vice versa.

Demo 2: Same setting as above, but 2-dimensional version with \( \mathcal N(\begin{bmatrix} 0\\0 \end{bmatrix}, \mathrm{diag}(2)) \).

Reference(s):

[1] Nonparametric statistics, Kernel, Hyperparameter optimization, Wikipedia.
[2] BOHB: Robust and Efficient Hyperparameter Optimization at Scale, Falkner et al. 2018.
[3] Algorithms for Hyper-Parameter Optimization, Bergstra et al. 2011.
[4] Kernel density estimation slides, NCCU.