Machine Learning

Kevin P. Murphy

Mentioned 3

A comprehensive introduction to machine learning that uses probabilistic models and inference as a unifying approach.

More on Amazon.com

Mentioned in questions and answers.

Given a posterior p(Θ|D) over some parameters Θ, one can define the following:

Highest Posterior Density Region:

The Highest Posterior Density Region is the set of most probable values of Θ that, in total, constitute 100(1-α) % of the posterior mass.

In other words, for a given α, we look for a *p** that satisfies:

enter image description here

and then obtain the Highest Posterior Density Region as the set:

enter image description here

Central Credible Region:

Using the same notation as above, a Credible Region (or interval) is defined as:

enter image description here

Depending on the distribution, there could be many such intervals. The central credible interval is defined as a credible interval where there is (1-α)/2 mass on each tail.

Computation:

  • For general distributions, given samples from the distribution, are there any built-ins in to obtain the two quantities above in Python or PyMC?

  • For common parametric distributions (e.g. Beta, Gaussian, etc.) are there any built-ins or libraries to compute this using SciPy or statsmodels?

From my understanding "central credible region" is not any different from how confidence intervals are calculated; all you need is the inverse of cdf function at alpha/2 and 1-alpha/2; in scipy this is called ppf ( percentage point function ); so as for Gaussian posterior distribution:

>>> from scipy.stats import norm
>>> alpha = .05
>>> l, u = norm.ppf(alpha / 2), norm.ppf(1 - alpha / 2)

to verify that [l, u] covers (1-alpha) of posterior density:

>>> norm.cdf(u) - norm.cdf(l)
0.94999999999999996

similarly for Beta posterior with say a=1 and b=3:

>>> from scipy.stats import beta
>>> l, u = beta.ppf(alpha / 2, a=1, b=3), beta.ppf(1 - alpha / 2, a=1, b=3)

and again:

>>> beta.cdf(u, a=1, b=3) - beta.cdf(l, a=1, b=3)
0.94999999999999996

here you can see parametric distributions that are included in scipy; and I guess all of them have ppf function;

As for highest posterior density region, it is more tricky, since pdf function is not necessarily invertible; and in general such a region may not even be connected; for example, in the case of Beta with a = b = .5 ( as can be seen here);

But, in the case of Gaussian distribution, it is easy to see that "Highest Posterior Density Region" coincides with "Central Credible Region"; and I think that is is the case for all symmetric uni-modal distributions ( i.e. if pdf function is symmetric around the mode of distribution)

A possible numerical approach for the general case would be binary search over the value of p* using numerical integration of pdf; utilizing the fact that the integral is a monotone function of p*;


Here is an example for mixture Gaussian:

[ 1 ] First thing you need is an analytical pdf function; for mixture Gaussian that is easy:

def mix_norm_pdf(x, loc, scale, weight):
    from scipy.stats import norm
    return np.dot(weight, norm.pdf(x, loc, scale))

so for example for location, scale and weight values as in

loc    = np.array([-1, 3])   # mean values
scale  = np.array([.5, .8])  # standard deviations
weight = np.array([.4, .6])  # mixture probabilities

you will get two nice Gaussian distributions holding hands:

enter image description here


[ 2 ] now, you need an error function which given a test value for p* integrates pdf function above p* and returns squared error from the desired value 1 - alpha:

def errfn( p, alpha, *args):
    from scipy import integrate
    def fn( x ):
        pdf = mix_norm_pdf(x, *args)
        return pdf if pdf > p else 0

    # ideally integration limits should not
    # be hard coded but inferred
    lb, ub = -3, 6 
    prob = integrate.quad(fn, lb, ub)[0]
    return (prob + alpha - 1.0)**2

[ 3 ] now, for a given value of alpha we can minimize the error function to obtain p*:

alpha = .05

from scipy.optimize import fmin
p = fmin(errfn, x0=0, args=(alpha, loc, scale, weight))[0]

which results in p* = 0.0450, and HPD as below; the red area represents 1 - alpha of the distribution, and the horizontal dashed line is p*.

enter image description here

I am new at the domain of machine learning and i have noticed that there are a lot of algorithms/ set of algorithms that can be used: SVM, decision trees, naive bayes, perceptron etc... That is why I wonder which algorithm should one use for solving which issue? In other words which algorithm solves which problem class?

So my question is if you know a good web site or book that focuses on this algorithm selection problematic?

Any help would be appreciated. Thx in advance.

Horace

It is very hard answer the question “which algorithm for which issue?”

That ability comes with a lot of experience and knowledge. So I suggest, you should read few good books about machine learning. Probably, following book would be a good starting point.

Machine Learning: A Probabilistic Perspective

Once you have some knowledge about machine learning, you can work on couple of simple machine learning problems. Iris flower dataset is a good starting point. It consists of several features belonging to three types of Iris species. Initially develop a simple machine learning model (such as Logistic Regression) to classify Iris species and gradually you could move to more advanced models such as Neural Networks.

So in Matlab I perform PCA on handwritten digits. Essentially, I have say 30*30 dimensional pictures, i.e. 900 pixels, and I consider after PCA the components which capture most of the variance, say the first 80 principal components(PC) based on some threshold. Now these 80 PCs are also of 900 dimension, and when i plot these using imshow, I get some images, like something looking like a 0, 6, 3, 5 etc. What is the interpretation of these first few of the PCs (out of the 80 I extracted)?

PCA extracts the most important information from the data set and compresses the size of the data set by keeping only the important information - principal components.

The first principal component is constructed in such a way that it has the largest possible variance. The second component is computed under the constraint of being orthogonal to the first component and to have the largest possible variance.

In your case the data is a set of images. Let's say you have 1000 images and you compute the first five principal components (5 images, constructed by the PCA algorithm). You may represent any image as 900 data points (30x30 pixels) or by the combination of 5 images with the corresponding miltiplication coefficients.

The goal of the PCA algorithm is to construct these 5 images (principal componenets) in such a way that images in your data set are represented most accurately with the combination of the given number of principal components.

UPDATE:

Consider the image below (from the amazing book by Kevin Murphy). The image shows how points in 2 dimensions (red dots) are represented in 1 dimension (green crosses) by projecting them to the vector (purple line). The vector is the first principal component. The purpose of PCA is to build these vectors to minimize the reconstruction error. In your case these vectors can be represented as images.

enter image description here

You may refer to this article for more details on using PCA for handwritten digit recognition.