MultiKDE.jl [link]
A lazily evaluated multivariate kernel density estimation library for Julia.
Kernel Density Estimation
Kernel density estimation, also known as KDE, is a classic algorithm in nonparametric statistics. The objective of KDE is to estimate a distribution from observations without assuming a closed-form distribution. A kernel function is required for KDE. It is essentially a probabilistic mass function, but in this setting it is used to estimate an unknown distribution. In most cases, it is desirable for a kernel function to have two properties:
- Normalization: \(\int_{-\infty}^{\infty} K(u) du = 1\), i.e., it integrates to one (this will be useful later).
- Symmetry: \(K(u) = K(-u)\), which means the influence of one point should be symmetric around the surrounding points.
Suppose we have \(D=\{x_{1}, x_{2}, \ldots, x_{n}\}\), a set of observations of random variable \(X\). We want to estimate the probability distribution of \(X\) using kernel function \(K\). Then, with KDE, the probability mass at point \(x\) is estimated as \(\widehat{P}(X=x)=\frac{1}{nh}\Sigma_{i}^{n} K(\frac{x-x_{i}}{h})\), where \(h\) is a hyperparameter called the bandwidth that controls how \(D\) influences \(\widehat{P}\). The term \(\frac{1}{nh}\) is a normalization factor that ensures \(\int_{-\infty}^{\infty} \widehat{P}(X=u)du = \int_{-\infty}^{\infty} K(u)du\). It is then easy to see that when \(K\) is normalized, \(\widehat{P}\) is normalized as well.
Implementation
Although there are several nice KDE implementations in Julia, such as KernelDensity.jl and KernelDensityEstimate.jl, I felt it was still necessary to have another implementation when I tried to implement BOHB. Although these libraries are remarkable, they do not support several features I needed very well:
- Lazy Evaluation: We do not want to evaluate the whole density when initializing it. Instead, we just keep the data without any calculation, and evaluate \(\widehat{P}(X=\hat{x})\) at a specific \(\hat{x}\) only when we need it. This is because in BOHB, the observations can change frequently as the algorithm evolves. A full evaluation would be unnecessary and wasteful.
- Multidimension: BOHB is a hyperparameter optimization algorithm that uses the Tree Parzen Estimator. When sampling hyperparameters, it splits the previously evaluated hyperparameters into two halves with respect to their performance. We then fit two KDEs over them: KDE-good, which uses the better-performing half, and KDE-bad, which uses the worse-performing half. Note that all observations are combinations of hyperparameters, so we need to treat the KDE as a multi-dimensional function. Here we use the product kernel, following the implementation in statsmodels. The original BOHB paper mentions using this library.
- Ordered and Unordered Categorical Random Variable: For hyperparameter optimization, the input space can sometimes be categorical (such as the number of threads to use or the number of layers in a model) or unordered categorical (such as the choice of optimization method). We need to support KDE for these kinds of variables. Specifically, statsmodels uses two kernel functions specialized for them, named after the authors: Wang-Ryzin and Aitchison-Aitken.
DEMO [Code]
Demo 1: KDE visualization over 50 random observations from \( \mathcal{N}(0, 1) \) using a gaussian kernel, with different bandwidths. A smaller \(h\) makes the curve more fluctuating and more sensitive to observations, and vice versa.
Demo 2: Same setting as above, but 2-dimensional version with \( \mathcal N(\begin{bmatrix} 0\\0 \end{bmatrix}, \mathrm{diag}(2)) \).
references:
[1] Nonparametric statistics, Kernel, Hyperparameter optimization, Wikipedia.
[2] BOHB: Robust and efficient hyperparameter optimization at scale, Falkner et al. 2018.
[3] Algorithms for hyper-parameter optimization, Bergstra et al. 2011.
[4] Kernel density estimation slides, NCCU.