KL divergence
| Original link | http://yamlb.wordpress.com/?p=10 |
| Date | 1970-01-01 |
| Status | draft |
The Kullback and Leibler divergence is a common measure of the “distance” between two probability distributions. It’s central in machine learning algorithm based on probabilities.
For instance, when trying to approximate a distribution p(x), we can try minimize KL(p,q) with q belonging to a particular class of distributions (ex: exponential family).
This is used in a lot of variational methods and approximate message passing algorithms.
But why using this divergence (which isn’t a real distance)? Why not using L^2 distance ? Or Chi-square ?
I found several leads:
- Information geometry: KL is a special case of delta-divergences. These divergences have the great advantage to be invariant by reparametrisation.
- Information theory: KL can be seen as the amount of information (in bits) missing to q in order to specify p. (conditional entropy). It is the average "surprise" of a incoming message drawn from q when you expect it arrived form p.
- Bayesian theory: KL minimisation can be derived from log-likelihood maximisation.
Invariance seems to be the more general requirement because it leads to a family of divergences. Exchanging KL with a delta-divergence in our algorithm can give a better understanding on what’s really going on.
The information theory justification seems weaker to me because it takes place in a theory of communication, and it requires a subjective receiver.
Finally I’m still not sure of my derivation from Log Likelihood minimisation, especially in the continous case.