Supervised Learning

1 Performance with statistical output

1.1 BC Coefficient

Measures the overlap between P(x) and Q(x).

\begin{equation} BC(P,Q) = \int_{x} P(x)Q(x)dx \end{equation}

1.2 KL Divergence

Measures how much information you lose by using distribution Q to aproximate distribution P.

\begin{equation} KL(P|Q) = \int_{x} P(x)\log{\frac{P(x)}{Q(x)}}dx \end{equation}

2 Cross Entropy

2.1 Entropy

Measures the average level of surprise of the outcome of the value of a random variable. A probabilty distribution with peeks has low entropy, a uniform one has high entropy. Entropy is also called information. It measures the number of bits needed to transmit a randomly selected event from a probability distribution. An event has more information the less likely it is.

\begin{equation} H(X) = -\sum_{x \in X}{p(x)\log{p(x)}} = \mathbb{E}[-\log{p(x)}] \end{equation}

2.2 Cross Entropy

Intro

Cross entropy is formally defined like this:

Cross entropy is the average number of bits needed to encode data coming from a source distributed with probability P when using model Q.

\begin{align} H(P,Q) &= H(P) + KL(P|Q) \\ &= -\int_{x} P(x)\log{P(x)} + \int_{x} P(x)\log{\frac{P(x)}{Q(x)}}dx \\ &= -\int_{x} P(x)\log{P(x)} + \int_{x} P(x)(\log{P(x)} - \log{Q(x)}) \\ &= -\int_{x} P(x)\log{P(x)} + \int_{x} P(x)\log{P(x)} - P(x)\log{Q(x)}) \\ &= -\int_{x} P(x)\log{P(x)} + \int_{x} P(x)\log{P(x)} - \int_{x} P(x)\log{Q(x)}) \\ &= - \int_{x} P(x)\log{Q(x)})\\ \end{align}

Expected value of cross entropy measurment in the discrete case:

\begin{equation} \mathbb{E_{D}}H(P,Q) = \frac{1}{|D|}\sum_{x \in D}\Bigl(-\sum_{i \in K}P_{i}(x)\log{Q_{i}(x)}\Bigr) \end{equation}

where \(D\) is the dataset, the problem is a \(k\) -class classification problem.

Formula for the binary case

\begin{align} \mathbb{E_{D}}H(P,Q) &= \frac{1}{|D|}\sum_{x \in D}\Bigl(-\sum_{i \in K}P_{i}(x)\log{Q_{i}(x)}\Bigr)\\ &= \frac{1}{|D|}\sum_{x \in D}\Bigl(-\sum_{i \in \{0,1\}}P_{i}(x)\log{Q_{i}(x)}\Bigr)\\ &= \frac{1}{|D|}\sum_{x \in D}\Bigl(- (P_{0}(x)\log{Q_{0}(x)} + P_{1}(x)\log{Q_{1}(x)})\Bigr) \\ &= \frac{1}{|D|}\sum_{x \in D}\Bigl(- (P_{0}(x)\log{Q_{0}(x)} + (1-P_{0}(x))\log{(1-Q_{0}(x))})\Bigr) \end{align}

Proof

HP: Kraft-McMillan theorem, a value \(x_{i}\) can be identified with \(l_{i}\) bits with probability \(Q(x_{i}) = 2^{-l_{i}}\).
TH: \(H(P,Q) = -\sum_{x}P(x)\log{Q(x)}\)
Proof: The real probability of \(x_{i}\) being identified with \(l_{i}\) bits is \(P(x_{i})\) \begin{align} \mathbb{E}_{P}[l] &= -\mathbb{E}_{P}[\log_{2}Q(x)] \\ &= -\sum_{x_{i}}\underbrace{P(x_{i})}_{\text P(l_{i})}\underbrace{[\log_{2}Q(x_{i})]}_{\text l_{i}} \end{align}