Supervised Learning

1 Probabilistic Classification

\(P(Y=c) = \pi_{c}\): Prior probability, probability density of seeing a data case of class Y. For visualization purposes, it is a function with \(c\) on the x axis and \(P(Y=C)\) on the y axis.
\(P(X=x|Y=c) = \phi_{c}(x)\): Likelyhood function, probability density of seeing a data vector \(x \in R^{\mathbb{D}}\) that belongs to class \(c\); For visualization purposes, it's a function with \(x\) (samples) on the x axis and \(P(X=x|Y=C)\) on the y axis. This function is referred to a specific class \(C\). The function describes how likely it is for sample \(x_{i}\) to be of class \(C\);
Using Bayes Rules we can compute the A-posteriori probability distribution of any observed feature vector belonging to class \(c\). This distribution also takes into account the frequency of each class. \begin{align} P(Y=c|X=x) = \frac{P(X=x|Y=c)P(Y=c)}{\sum_{c'}P(X=x|Y=c')P(Y=c')} = \frac{\phi_{c}(x)\pi_{c}}{\sum_{c'}\phi_{c'}(x)\pi_{c'}} \end{align}

1.1 Maximum likelyhood classification

Learn from the dataset D a likelyhood function for every class \(c_{i}\). Can be written as \(P(x|c_{i})\)
Given a sampled \(x'\) evaluate each one of the k likelyhood functions.
Pick the class \(c'\) that has the highest likelyhood function \(argmax_{c}P(x'|c)\)

The problem of this method is that an example \(x'\) of class \(c'\) taken from \(\mathbb{D}\) is likely to be \(x_{1}\) \(50\%\) of the times, but this probability might be calculated on a dataset with very few examples of class \(c'\). This means that class \(c'\) might be taken instead of class \(c''\) whose likelyhood function is calculated on a dataset with many examples.

1.2 Maximum A-posteriori classification

The a-posteriori classification solves the previoius problem.

Learn from the dataset D a likelyhood function for every class \(c_{i}\). Can be written as \(P(x|c_{i}\));
Given \(x'\), apply Bayes Rules and obtain \(P(c_{i}|x')\);
Pick the class \(c'\) that has the highest A-Posteriori probability \(argmax_{c}P(c|x')\);

1.3 Naive Bayes Classification

The point of this tecnique is to aproximate the likelihood function for each class for each feature in a way that prevents overfitting. If we have a small dataset and we just count elements, the risk of falling into overfitting is high. The concept is to assume the shape of the distribution of the likelihood function for each class and for each feature (usually a gaussian distribution). Let's imagine that our \(x\) vectors just have 1 feature. Now we have to find the parametric maximul likelyhood, which is a function based on some paramters \(\theta\) that used in this formula gives the highest value:

\begin{equation} \theta' = argmax_{\theta}\prod_{x_{i} \in D}p(x_{i}|\theta) = argmax_{\theta}\sum_{x_{i} \in D}log(p(x_{i}|\theta)) \end{equation}

To make it a little bit more visual: if you have many x values that are near \(\bar{x}\), it means that it is better if the max value of your distribution is for \(x=\bar{x}\). If the distribution is a gaussian, it means that the mean has to be \(\bar{x}\). If we suppose that the distribution is a gaussian, then we want to maximize this function:

\begin{equation} L(\mu, \sigma) = \sum_{i=1}^{N}\ln{\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x_{i}-\mu)^2}{\sigma^{2}}}} \end{equation}

To do it we can set \(\frac{dL}{d\mu} = 0\) to find \(\mu\) and \(\frac{dL}{d\sigma} = 0\) to find \(\sigma\).

\begin{align} 0 &= \sum_{i=1}^{N} \ln\bigl({\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x_{i}-\mu)^2}{\sigma^2}}\bigr)}d\mu \\ &= \sum_{i=1}^{N} \ln\bigl({\frac{1}{\sqrt{2\pi\sigma^{2}}}}\bigr) + \ln\bigl({e^{-\frac{(x_{i}-\mu)^2}{\sigma^2}}}\bigr)d\mu \\ &= \sum_{i=1}^{N} \ln\bigl({\frac{1}{\sqrt{2\pi\sigma^{2}}}}\bigr) - {\frac{(x_{i}-\mu)^2}{\sigma^2}} d\mu \\ &= \sum_{i=1}^{N} \frac{2\mu x_{i}}{\sigma^2} - \frac{\mu^2}{\sigma^2} \\ 0 &= \sum_{i=1}^{N} x_{i} - \mu \\ 0 &= \sum_{i=1}^{N} \bigl(x_{i}\bigr) - N\mu \\ N\mu &= \sum_{i=1}^{N} \bigl(x_{i}\bigr) \\ \mu &= \frac{1}{N}\sum_{i=1}^{N} \bigl(x_{i}\bigr) \end{align}

If we do the same for \(\sigma\) we'll find the formula of the variance. The problem is that a single \(x\) is itself a vector of features \((x_{1} \hdots x_{k})\) so we need a probability distribution for each feature. In the naive version, we assume that each marginal probability distribution \(P(X_{d} = x_{d}|Y = c)\) is independent from the others, so a comprehensive score is calculated:

\begin{equation} \phi_{c}(x) = p(X=x|Y=c) = \prod_{d=1}^{k}p(X_{d} = x_{d} | Y = c) = \prod_{d=1}^{k}\phi_{cd}(x_{d}) \end{equation}

Now we can write the formula of the classification function, that given a \(k\) -dimensional sample \(x\) returns the predicted class to which it belongs:

\begin{equation} fNB(x) = argmax_{c}\pi_{c}\prod_{d=1}^{k}\phi_{cd}(x_{d}) \end{equation}

The marginal distributions are usually modeled as a normal density:

\begin{equation} \phi_{cd}(x_{d}) = P(X_{d} = x_{d} | Y = c) = N(x_{d}; \mu_{dc}, \sigma^{2}_{dc}) \end{equation}

If we are using the normal distribution the whole point of the learning process is to compute \(\mu_{dc}\) and \(\sigma^{2}_{dc}\) from the dataset.

\begin{equation} \mu_{dc} = \frac{\sum_{i=1}^{n}[y_{i}=c]x_{di}}{[y_{i}=c]} \\ \sigma^{2} = ... \end{equation}

1.4 Geometric Interpretation

Let's suppose we are in a binary case where we want to assign either label 0 or 1 to points. It's called decision boundary the set of points where this happens:

\begin{align} \pi_{0} \prod_{d=1}^{D}\phi_{0d}(x_{d}) &= \pi_{1} \prod_{d=1}^{D}\phi_{1d}(x_{d}) \\ \log(\pi_{0} \prod_{d=1}^{D}\phi_{0d}(x_{d})) &= \log(\pi_{1} \prod_{d=1}^{D}\phi_{1d}(x_{d})) \\ \log(\pi_{0}) + \log(\prod_{d=1}^{D}\phi_{0d}(x_{d})) &= \log(\pi_{1}) + \log(\prod_{d=1}^{D}\phi_{1d}(x_{d})) \\ \log(\pi_{0}) + \sum_{d=1}^{D}\log(\phi_{0d}(x_{d})) &= \log(\pi_{1}) + \prod_{d=1}^{D}\log(\phi_{1d}(x_{d})) \\ \vdots \\ \log(\pi_{0}) + \sum_{d=1}^{D}-\frac{1}{2}\log(2\pi\sigma_{0d}^{2})-\frac{1}{\sigma^{2}_{0d}}(x-\mu_{cd})^2 &= \log(\pi_{1}) + \sum_{d=1}^{D}-\frac{1}{2}\log(2\pi\sigma_{1d}^{2})-\frac{1}{\sigma^{2}_{1d}}(x-\mu_{1d})^2\\ \end{align}

Which is a quadratic function of \(x\).