[Somnath Basu Roy Chowdhury]

Donsker-Varadhan Lower Bound

Recently I have been obsessed with the concept of mutual information , in layman terms it signifies the degree of dependency between two variables. I came across an interesting paper relevant to this concept. More specifically any distribution-free high-confidence lower bound on mutual information cannot be larger than \(O(\ln(N))\) where \(N\) is the size of the data sample. This blog will focus only on the computation of a lower bound on which the paper is built on.

Terminology

Before we start to discuss the Donsker-Varadhan lower bound, let us define some of the terminology first. Firstly, we discuss the Kullback–Leibler (KL) divergence, which measures how a probability distribution is different from another distribution. It is defined as \[D_{KL}(P || Q) := \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)}\] Some of the properties of KL divergence are discussed below

\(D_{KL}(P||Q)\) only converges to zero iff \(p(x) = q(x)\)
The range of KL divergence is \([0, \infty]\), it cannot be negative
\(D_{KL}(P||Q)\) cannot to be considered as a metric, as it is not symmetric \[D_{KL}(P||Q) \neq D_{KL}(Q||P)\]
For KL divergence to be finite, \(p(x)\) should be defined within the support (domain) of \(q(x)\). If \(q(x) = 0\) where \(p(x) > 0, D_{KL}(P||Q) = \infty\)

Mutual information can also be defined as the KL divergence between two distributions \(p(x, y)\) and \(p(x) p(y)\). If the two variables are indepedent of each other then \(p(x, y) = p(x)p(y), D_{KL} = 0\) and if they are not the divergence between these distribution is a measure of their dependency. \[I(X; Y) := \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)}\] In the subsequent sections we will use this notations \(P_{XY} = p(x, y),\; P_X = p(x),\; P_Y = p(y)\)

Proof

Without further ado, we jump right into the proof of the DV bound. We start with the assumption of three distributions \(P, \; Q, \; R\) defined on the same support. The analysis also assumes discrete distributions \[\begin{align*} D_{KL}(P, Q) &= \mathop{\mathbb{E}}_{z \sim P} \ln \frac{p(z)}{q(z)} \\[1em] &= \mathop{\mathbb{E}}_{z \sim P} \ln (\frac{g(z)}{q(z)}\frac{p(z)}{g(z)}) \\[1em] &= \mathop{\mathbb{E}}_{z \sim P} \ln \frac{g(z)}{q(z)} + D_{KL}(P||G) \\[1em] &\geq \mathop{\mathbb{E}}_{z \sim P} \ln \frac{g(z)}{q(z)} \end{align*}\] The equality in the above equation is achieved only when \(p(z) = g(z)\), therefore \[D_{KL}(P||Q) = \sup_G \mathop{\mathbb{E}}_{z \sim P} \ln \frac{g(z)}{q(z)}\] \(D_{KL}(P||Q)\) becomes the supremum for the set when the equality condition is satisfied. We can parameterize \(G(z)\) such that it can be computed. However, our ultimate aim is to computer \(D_{KL}(P_{XY}|| P_X P_Y) \), where we can only access the distributions via sampling.
For our problem in mutual information, \(P = p(x, y), \; Q = p(x)p(y)\). For computing the lower bound we follow the given steps

Sample \((x, y) \sim p(x, y)\), is equivalent to sampling from \(P\)
Sample \((x, y) \sim p(x, y)\), then ignore \(y\) is equivalent to sampling from \(P_X\) and repeat the same for \(P_Y\). The sample \((x, y)\) is then sampled from \(Q = P_XP_Y\).
As we cannot evaluate \(Q(z)\; \forall\; (x, y) \sim p(x, y)\), change of variables so that the equation is restricted only to sampling from \(Q(z)\) by defining \(G(z)\) as \[G(z) = \frac{1}{Z} Q(z)e^{F(z)}\;\; \mathrm{, where}\;\; Z = E_{z \sim Q} e^{F(z)}\]

Substituting the above formulation in the original divergence equation gives \[D_{KL}(P||Q) = \sup_{F} \mathop{\mathbb{E}}_{z \sim P} F(z) - \ln \mathop{\mathbb{E}}_{z \sim Q} e^{F(z)}\] The above equation is the Donsker-Varadhan Lower bound, and this can be applied to mutual information. \[\begin{align*} I(X, Y) &= D_{KL} (P_{X, Y}|| P_XP_Y) \\[1em] &= \sup_{F} E_{x, y \sim p(x, y)} F(x, y) - \ln E_{x \sim p(x), y \sim p(y)} e^{F(x, y)}\end{align*}\] \(F\) is an unconstrained function in the above equation. This DV lower bound can be used as a loss function for maximizing mutual information through stochastic gradient descent via sampling.

Donsker-Varadhan Lower Bound

Terminology

Proof

References