Original link http://yamlb.wordpress.com/?p=10
Date 1970-01-01
Status draft

The Kullback and Leibler divergence is a common measure of the “distance” between two probability distributions. It’s central in machine learning algorithm based on probabilities.

For instance, when trying to approximate a distribution p(x), we can try minimize KL(p,q) with q belonging to a particular class of distributions (ex: exponential family).

This is used in a lot of variational methods and approximate message passing algorithms.

But why using this divergence (which isn’t a real distance)? Why not using L^2 distance ? Or Chi-square ?

I found several leads:

  • Information geometry: KL is a special case of delta-divergences. These divergences have the great advantage to be invariant by reparametrisation.
  • Information theory: KL can be seen as the amount of information (in bits) missing to q in order to specify p. (conditional entropy). It is the average "surprise" of a incoming message drawn from q when you expect it arrived form p.
  • Bayesian theory: KL minimisation can be derived from log-likelihood maximisation.

Invariance seems to be the more general requirement because it leads to a family of divergences. Exchanging KL with a delta-divergence in our algorithm can give a better understanding on what’s really going on.

The information theory justification seems weaker to me because it takes place in a theory of communication, and it requires a subjective receiver.

Finally I’m still not sure of my derivation from Log Likelihood minimisation, especially in the continous case.