11 Entropy and the Kullback-Leibler divergence

We have defined probability distributions and their associated probability mass function and probability density functions. There are an infinitude of distributions we could use to model data generation processes, and we might wish to compare distributions. To do that, we need a definition of “closeness” of two probability distributions. To get this definition, we need to turn to notions about information.

11.1 Information and entropy

Formally, information is the reduction in ignorance derived from learning an outcome. Say event \(i\) happens with probability \(p_i\). If \(i\) is very probable and we observe it, we haven’t learned much. For example, if we observe that the current pope is Catholic, we haven’t learned much about popes. That is, we are still pretty ignorant about popes. But if \(i\) is very improbable and we observe it, we have learned a lot. If we however observe that the current pope is American, we have learned something new and interesting about the pope.

To codify this in mathematical terms, we might think that the information gained by observing event \(i\) should scale like \(1/p_i\), since more rare events give higher information.

Now, say we observe two independent events, \(i\) and \(j\). Since they are totally independent, the information garnered from observing both should be the sum of the information garnered from observing each. We know that the probability of observing both is \(1/p_ip_j\). But

\[\begin{aligned} \frac{1}{p_i} + \frac{1}{p_j} \ne \frac{1}{p_ip_j}. \end{aligned} \tag{11.1}\]

So, our current metric of information as \(1/p_i\) does not satisfy this addability requirement. However,

\[\begin{aligned} \log\frac{1}{p_i} + \log\frac{1}{p_j} = \log \frac{1}{p_ip_j}. \end{aligned} \tag{11.2}\]

So, we choose \(\log (1/p_i) = -\log p_i\) as a measure of information. We are free to choose the base of the logarithm, and it is traditional to choose base 2. The units of information are then called bits. We, however, will use natural logarithms for convenience.

Now, say we have an ensemble of events. Then the average information we get from observing events (i.e., the level of surprise) is

\[\begin{aligned} H[p] = -\sum_i p_i\,\ln p_i. \end{aligned} \tag{11.3}\]

This is called the Shannon entropy or informational entropy. It has its name because of its relation to the same quantity in statistical thermodynamics. We will not delve into that, although it is a rich and beautiful subject.

Let’s look at the Shannon entropy another way. Say we know all of the \(p_i\)’s. How much knowledge do we know about what events we might observe? If the probability distribution is flat, not much. Conversely, if it is sharply peaked, we know a lot about what event we will observe. In the latter case, observing an event does not give us more information beyond what we already knew from the probabilities. So, \(H[p]\) is a measure of ignorance. It tells us how uncertain or unbiased we are ahead of an observation.

I pause to note that we shortcutted our way into this definition of entropy by using some logic and the desire that independent events add. A more careful derivation was done in 1948 by Claude Shannon. He showed that the function we wrote for the entropy is the only function that satisfies three desiderata about measurements of ignorance.

Entropy is continuous in \(p_i\).
If all \(p_i\) are equal, entropy is monotonic in \(n\), the number of events we could observe.
Entropy satisfies a composition law; grouping of events does not change the value of entropy.

The derivation is beautiful, but we will not go into it here.

11.2 Cross entropy

We can extend this notion of entropy to define cross entropy, \(H[p, q]\). This is the amount of information (or loss of ignorance) needed to identify an event \(i\) described by probability \(p_i\) when we use some other probability \(q_i\). In other words, it tells us how much ignorance we have in using \(q\) to describe events governed by \(p\). The cross entropy is

\[\begin{aligned} H[p,q] = -\sum_i p_i\, \ln q_i. \end{aligned} \tag{11.4}\]

11.3 The Kullback-Leibler divergence

We may think about how close \(p\) and \(q\) are. The additional entropy induced by using \(q\) instead of \(p\) is \(H[p, q] - H[p]\). We can use this as a measure of closeness of \(q\) to \(p\). This is called the Kullback-Leibler divergence, also known as the KL divergence,

\[\begin{aligned} D_\mathrm{KL}(p \| q) = H[p,q] - H[p] = -\sum_i p_i\, \ln q_i + \sum_i p_i\,\ln p_i = \sum_i p_i\,\ln\frac{p_i}{q_i}. \end{aligned} \tag{11.5}\]

Note that the KL divergence is not a distance, in that it is directional; \(D_\mathrm{KL}(p \| q) \ne D_\mathrm{KL}(q \| p)\), except when \(p = q\). So, the KL divergence \(D_\mathrm{KL}(p \| q)\) is a measure of the closeness of \(q\) to \(p\), and not a measure of the closeness of \(p\) to \(q\).

The definition of the KL divergence generalizes to continuous distributions and their probability density functions as ¹

\[\begin{aligned} D_\mathrm{KL}(f_0 \| f) = \int \mathrm{d}y\,\,f_0(y)\,\ln\frac{f_0(y)}{f(y)}. \end{aligned} \tag{11.6}\]

11.4 Jensen’s inequality

In a moment, we will prove that the Kullback-Leibler divergence is always positive, except when \(q = p\), when it is zero. An important and useful result toward that end is Jensen’s inequality.

A function \(f(\mathbf{x})\) is convex if for any \(\mathbf{x}\) and \(\mathbf{y}\) on the domain of \(f\) and \(\theta\) with \(0 \le \theta \le 1\),

\[\begin{align} f(\theta\mathbf{x} + (1 - \theta)\mathbf{y}) \le \theta f(\mathbf{x}) + (1 - \theta)f(\mathbf{y}). \end{align} \tag{11.7}\]

The function \(f(\mathbf{x})\) is strictly convex if strict inequality holds in inequality 11.7 for any \(\mathbf{x} \ne \mathbf{y}\). This is a mathematical statement that the curve lies below any cord between two points of the function.

Now consider \(\{\theta_1, \theta_2, \ldots, \theta_n\}\), with all \(\theta_i\)’s satisfying \(0 \le \theta_i \le 1\) with \(\sum_{i=1}^n \theta_i = 1\). Then, for a convex function \(f\),

\[\begin{align} f\left(\sum_{i=1}^n\theta_i \mathbf{x}_i\right) \le \sum_{i=1}^n\theta_i \, f(\mathbf{x}_i). \end{align} \tag{11.8}\]

This is known as Jensen’s inequality. If \(f\) is strictly convex, then equality is achieved if and only if \(\mathbf{x}_i\) is the same value for all \(i\). This has obvious implications when considering \(\theta_i\) to be a probability mass. If \(\theta_i = P(\mathbf{x}_i)\), the probability of observing \(\mathbf{x}_i\), then Jensen’s inequality relates expectations,

\[\begin{align} f(\langle \mathbf{x} \rangle) = f\left(\sum_{i=1}^n \mathbf{x}_i\, P(\mathbf{x}_i)\right) \le \langle f(\mathbf{x}) \rangle = \sum_{i=1}^n f(\mathbf{x})\,P(\mathbf{x}_i). \end{align} \tag{11.9}\]

This extends to continuous distributions with \(P(\mathbf{x})\) being a probability density such that

\[\begin{align} \int \mathrm{d}\mathbf{x}\,P(\mathbf{x}) = 1, \end{align} \tag{11.10}\]

so that

\[\begin{align} f(\langle \mathbf{x} \rangle) = f\left(\int\mathrm{d}\mathbf{x} \,\mathbf{x}\, P(\mathbf{x})\right) \le \langle f(\mathbf{x}) \rangle = \int\mathrm{d}\mathbf{x}\,f(\mathbf{x})\,P(\mathbf{x}). \end{align} \tag{11.11}\]

It is evident, then, that inequality 11.7 is a special case of Jensen’s inequality.

11.5 Gibbs’s inequality (positivity of the Kullback-Leibler divergence)

Jensen’s inequality has obvious implications when considering \(\theta_i\) to be a probability mass. If \(\theta_i = P(\mathbf{x}_i)\), the probability of observing \(\mathbf{x}_i\), then Jensen’s inequality relates expectations,

The direction of the inequality is flipped for concave \(f\).

We will now prove that \(-D_\mathrm{KL}(p\| q) \le 0\), which is equivalent to proving nonnegativity of the Kullback-Leibler divergence, \(D_\mathrm{KL}(p\| q) \ge 0\). The nonnegativity of the Kullback-Leibler divergence is known as Gibbs’s inequality.

We have

\[ \begin{align} -D_\mathrm{KL}(p\| q) = -\sum_{i=1}^N p_i\,\ln\left(\frac{p_i}{q_i}\right) = \sum_{i=1}^N p_i\,\ln\left(\frac{q_i}{p_i}\right) = \left\langle \ln\left(\frac{q_i}{p_i}\right)\right\rangle. \end{align} \tag{11.13}\]

Note that the logarithm function is strictly concave, since \(\mathrm{d}^2\ln x/\mathrm{d}x^2 = -1/x^2\), the second derivative is always negative. Then, by Jensen’s inequality, we have

\[\begin{align} \left\langle \ln\left(\frac{q_i}{p_i}\right)\right\rangle \le \ln \left\langle\frac{q_i}{p_i}\right\rangle = \ln\left(\sum_{i=1}^N p_i\,\frac{q_i}{p_i}\right) = \ln\left(\sum_{i=1}^N q_i\right) = \ln 1 = 0. \end{align} \tag{11.14}\]

We have almost proved Gibbs’s inequality; we just need to prove that equality only holds when \(p_i = q_i\) for all \(i\). To do this, we first write a result from the above analysis. Let \(x_i = q_i/p_i\). Then,

\[\begin{align} \sum_{i=1}^N p_i\,\ln x_i \le \ln\left(\sum_{i=1}^N p_i x_i\right). \end{align} \tag{11.15}\]

Because the logarithm is strictly concave, by Jensen’s inequality, \(x_i\) must be a constant if the above holds with equality. Thus, \(q_i = c p_i\) for all \(i\), where \(c\) is a constant. But,

\[\begin{align} 1 = \sum_{i=1}^N q_i = \sum_{i=1}^N c p_i = c \sum_{i=1}^N p_i = c. \end{align} \tag{11.16}\]

Since \(c = 1\), equality is achieved in Gibbs’s inequality only when \(p_i = q_i\) for all \(i\).

I am playing a little fast and loose here converting sums to integrals. There are some subtleties involved therein, but we will not delve into those here.↩︎