@1007477689 2020-07-13T11:49:34.000000Z 字数 4377 阅读 635

ML - Week 0

机器学习

2.8 Information Theory

Information theory is concerned with representing data in a compact fashion (a task known as data compression or source coding), as well as with transmitting and storing it in a way that is robust to errors (a task known as error correction or channel coding). At first, this seems far removed from the concerns of probability theory and machine learning, but in fact there is an intimate connection. To see this, note that compactly representing data requires allocating short codewords to highly probable bit strings, and reserving longer codewords to less probable bit strings. This is similar to the situation in natural language, where common words (such as “a”, “the”, “and”) are generally much shorter than rare words. Also, decoding messages sent over noisy channels requires having a good probability model of the kinds of messages that people tend to send. In both cases, we need a model that can predict which kinds of data are likely and which unlikely, which is also a central problem in machine learning (see (MacKay $2003$ ) for more details on the connection between information theory and machine learning).

Obviously we cannot go into the details of information theory here (see e.g., (Cover and Thomas $2006$ ) if you are interested to learn more). However, we will introduce a few basic concepts that we will need later in the book.

2.8.1 Entropy

The entropy of a random variable $X$ with distribution $p$ , denoted by $\mathbb{H} (X)$ or sometimes $\mathbb{H} (p)$ , is a measure of its uncertainty. In particular, for a discrete variable with $K$ states, it is defined by

$\mathbb{H} (X) \triangleq \sum_{k=1}^{K} p(X = k) log_{2} \left[ p(X = k) \right] \tag{2.107}$

Usually we use log base $2$ , in which case the units are called bits (short for binary digits). If we use log base $e$ , the units are called nats. For example, if

$X \in \{1, . . . , 5 \}$ with histogram distribution

$p = \left[0.25, 0.25, 0.2, 0.15, 0.15 \right]$

we find $\mathbb{H} = 2.2855$ . The discrete distribution with maximum entropy is the uniform distribution (see Section 9.2.6 for a proof). Hence for a $K$ -ary random variable, the entropy is maximized if

$p(x = k) = \frac{1}{K}$

in this case,

$\mathbb{H} (X) = log_{2} K$

Conversely, the distribution with minimum entropy (which is zero) is any delta-function that puts all its mass on one state. Such a distribution has no uncertainty. In Figure 2.5(b), where we plotted a DNA sequence logo, the height of each bar is defined to be $2 − H$ , where $H$ is the entropy of that distribution, and $2$ is the maximum possible entropy. Thus a bar of height $0$ corresponds to a uniform distribution, whereas a bar of height $2$ corresponds to a deterministic distribution.

For the special case of binary random variables,

$X ∈ \{0, 1\}$

we can write $p(X = 1) = \theta$ and $p(X = \theta) = 1 − \theta$ . Hence the entropy becomes

$\mathbb{H} (X) = −[p(X = 1) log_{2} p(X = 1) + p(X = \theta) log_{2} p(X = 0)] \tag{2.108}$

$= −[\theta log_{2} \theta + (1 − \theta) log_{2}(1 − \theta)] \tag{2.109}$

This is called the binary entropy function, and is also written $H (θ)$ . We plot this in Figure 2.21. We see that the maximum value of $1$ occurs when the distribution is uniform, $\theta = 0.5$ .

2.8.2 KL divergence

One way to measure the dissimilarity of two probability distributions, $p$ and $q$ , is known as the Kullback-Leibler divergence (KL divergence) or relative entropy. This is defined as follows:

$\mathbb{KL} (p||q) \triangleq \sum_{k=1}^{K} p_{k} log \left( \frac{p_{k}}{q_{k}} \right) \tag{2.110}$

where the sum gets replaced by an integral for pdfs. We can rewrite this as

$\mathbb{KL} (p||q) = \sum_{k} p_{k} log p_{k} − \sum_{k} p_{k} log q_{k} = −\mathbb{H} (p) + \mathbb{H} (p, q) \tag{2.111}$

where $\mathbb{H} (p, q)$ is called the cross entropy,

$\mathbb{H} (p, q) \triangleq − \sum_{k} p_{k} log q_{k} \tag{2.112}$

One can show (Cover and Thomas 2006) that the cross entropy is the average number of bits needed to encode data coming from a source with distribution $p$ when we use model $q$ to define our codebook. Hence the “regular” entropy $\mathbb{H} (p) = \mathbb{H} (p, p)$ , defined in Section 2.8.1, is the expected number of bits if we use the true model, so the KL divergence is the difference between these. In other words, the KL divergence is the average number of extra bits needed to encode the data, due to the fact that we used distribution q to encode the data instead of the true distribution p.

The “extra number of bits” interpretation should make it clear that

$\mathbb{KL} (p||q) ≥ 0$

and that the $\mathbb{KL}$ is only equal to zero iff $q = p$ . We now give a proof of this important result.

2.8.3.1 Mutual information for continuous random variables *

The above formula for MI is defined for discrete random variables. For continuous random variables, it is common to first discretize or quantize them, by dividing the ranges of each variable into bins, and computing how many values fall in each histogram bin (Scott 1979). We can then easily compute the MI using the formula above (see mutualInfoAllPairsMixed for some code, and miMixedDemo for a demo).

Unfortunately, the number of bins used, and the location of the bin boundaries, can have a significant effect on the results. One way around this is to try to estimate the MI directly,