Donsker-Varadhan Lower Bound
Recently I have been obsessed with the concept of mutual information , in layman terms
it signifies the degree of dependency
between two variables. I came across an interesting paper
relevant to this concept. More specifically any distribution-free high-confidence lower
bound on mutual information cannot be larger than \(O(\ln(N))\) where \(N\) is
the size of the data sample. This blog will focus only on the computation of a lower bound on
which the paper is built on.
Terminology
Before we start to discuss the Donsker-Varadhan lower bound, let us define some of the terminology
first. Firstly, we discuss the Kullback–Leibler (KL) divergence, which measures how a probability
distribution is different from another distribution. It is defined as
\[D_{KL}(P || Q) := \sum_{x \in \mathcal{X}} p(x) \log \frac{p(x)}{q(x)}\]
Some of the properties of KL divergence are discussed below
- \(D_{KL}(P||Q)\) only converges to zero iff \(p(x) = q(x)\)
- The range of KL divergence is \([0, \infty]\), it cannot be negative
- \(D_{KL}(P||Q)\) cannot to be considered as a metric, as it is not symmetric
\[D_{KL}(P||Q) \neq D_{KL}(Q||P)\]
- For KL divergence to be finite, \(p(x)\) should be defined within the
support (domain) of \(q(x)\).
If \(q(x) = 0\) where \(p(x) > 0, D_{KL}(P||Q) = \infty\)
Mutual information can also be defined as the KL divergence between two distributions \(p(x, y)\) and
\(p(x) p(y)\). If the two variables are indepedent of each other then \(p(x, y) = p(x)p(y), D_{KL} = 0\) and
if they are not the divergence between these distribution is a measure of their dependency.
\[I(X; Y) := \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)p(y)}\]
In the subsequent sections we will use this notations \(P_{XY} = p(x, y),\; P_X = p(x),\; P_Y = p(y)\)
Proof
Without further ado, we jump right into the proof of the DV bound. We start with the assumption
of three distributions \(P, \; Q, \; R\) defined on the same support. The analysis also assumes
discrete distributions
\[\begin{align*} D_{KL}(P, Q) &= \mathop{\mathbb{E}}_{z \sim P} \ln \frac{p(z)}{q(z)}
\\[1em] &= \mathop{\mathbb{E}}_{z \sim P} \ln (\frac{g(z)}{q(z)}\frac{p(z)}{g(z)})
\\[1em] &= \mathop{\mathbb{E}}_{z \sim P} \ln \frac{g(z)}{q(z)} + D_{KL}(P||G)
\\[1em] &\geq \mathop{\mathbb{E}}_{z \sim P} \ln \frac{g(z)}{q(z)} \end{align*}\]
The equality in the above equation is achieved only when \(p(z) = g(z)\), therefore
\[D_{KL}(P||Q) = \sup_G \mathop{\mathbb{E}}_{z \sim P} \ln \frac{g(z)}{q(z)}\]
\(D_{KL}(P||Q)\) becomes the supremum for the set when the equality condition is satisfied. We can
parameterize \(G(z)\) such that it can be computed. However, our ultimate aim is to computer
\(D_{KL}(P_{XY}|| P_X P_Y) \), where we can only access the distributions via sampling.
For our problem in mutual information, \(P = p(x, y), \; Q = p(x)p(y)\). For computing the lower
bound we follow the given steps
- Sample \((x, y) \sim p(x, y)\), is equivalent to sampling from \(P\)
- Sample \((x, y) \sim p(x, y)\), then ignore \(y\) is equivalent to sampling from \(P_X\) and
repeat the same for \(P_Y\). The sample \((x, y)\) is then sampled from \(Q = P_XP_Y\).
- As we cannot evaluate \(Q(z)\; \forall\; (x, y) \sim p(x, y)\), change of variables so that
the equation is restricted only to sampling from \(Q(z)\) by defining \(G(z)\) as
\[G(z) = \frac{1}{Z} Q(z)e^{F(z)}\;\; \mathrm{, where}\;\; Z = E_{z \sim Q} e^{F(z)}\]
Substituting the above formulation in the original divergence equation gives
\[D_{KL}(P||Q) = \sup_{F} \mathop{\mathbb{E}}_{z \sim P} F(z) - \ln \mathop{\mathbb{E}}_{z \sim Q} e^{F(z)}\]
The above equation is the Donsker-Varadhan Lower bound, and this can be applied to mutual information.
\[\begin{align*} I(X, Y) &= D_{KL} (P_{X, Y}|| P_XP_Y)
\\[1em] &= \sup_{F} E_{x, y \sim p(x, y)} F(x, y) - \ln E_{x \sim p(x), y \sim p(y)} e^{F(x, y)}\end{align*}\]
\(F\) is an unconstrained function in the above equation. This DV lower bound can be used as a loss function
for maximizing mutual information through stochastic gradient descent via sampling.
References