Christopher M. Bishop

Mentioned 12

The field of pattern recognition has undergone substantial development over the years. This book reflects these developments while providing a grounding in the basic concepts of pattern recognition and machine learning. It is aimed at advanced undergraduates or first year PhD students, as well as researchers and practitioners.

Mentioned in questions and answers.

- What is
**machine learning**? - What does machine learning
**code**do ? - When we say that the machine learns, does it modify the code of
**itself**or it modifies history (database) which will contain the experience of code for given set of inputs?

Machine learning is a field of computer science, probability theory, and optimization theory which allows complex tasks to be solved for which a logical/procedural approach would not be possible or feasible.

There are several different categories of machine learning, including (but not limited to):

- Supervised learning
- Reinforcement learning

**Supervised Learning**

In supervised learning, you have some really complex function (mapping) from inputs to outputs, you have lots of examples of input/output pairs, but you don't know what that complicated function is. A supervised learning algorithm makes it possible, given a large data set of input/output pairs, to predict the output value for some new input value that you may not have seen before. The basic method is that you break the data set down into a training set and a test set. You have some model with an associated error function which you try to minimize over the training set, and then you make sure that your solution works on the test set. Once you have repeated this with different machine learning algorithms and/or parameters until the model performs reasonably well on the test set, then you can attempt to use the result on new inputs. Note that in this case, the program does not change, only the model (data) is changed. Although one could, theoretically, output a different program, but that is not done in practice, as far as I am aware. An example of supervised learning would be the digit recognition system used by the post office, where it maps the pixels to labels in the set 0...9, using a large set of pictures of digits that were labeled by hand as being in 0...9.

**Reinforcement Learning**

In reinforcement learning, the program is responsible for making decisions, and it periodically receives some sort of award/utility for its actions. However, unlike in the supervised learning case, the results are not immediate; the algorithm could prescribe a large sequence of actions and only receive feedback at the very end. In reinforcement learning, the goal is to build up a good model such that the algorithm will generate the sequence of decisions that lead to the highest long term utility/reward. A good example of reinforcement learning is teaching a robot how to navigate by giving a negative penalty whenever its bump sensor detects that it has bumped into an object. If coded correctly, it is possible for the robot to eventually correlate its range finder sensor data with its bumper sensor data and the directions that sends to the wheels, and ultimately choose a form of navigation that results in it not bumping into objects.

**More Info**

If you are interested in learning more, I strongly recommend that you read Pattern Recognition and Machine Learning by Christopher M. Bishop or take a machine learning course. You may also be interested in reading, for free, the lecture notes from CIS 520: Machine Learning at Penn.

Assume you know a student who wants to study Machine Learning and Natural Language Processing.

What introductory subjects would you recommend?

**Example**: I'm guessing that knowing Prolog and Matlab might help him. He also might want to study Discrete Structures*, Calculus, and Statistics.

*Graphs and trees. Functions: properties, recursive definitions, solving recurrences. Relations: properties, equivalence, partial order. Proof techniques, inductive proof. Counting techniques and discrete probability. Logic: propositional calculus, first-order predicate calculus. Formal reasoning: natural deduction, resolution. Applications to program correctness and automatic reasoning. Introduction to algebraic structures in computing.

This related stackoverflow question has some nice answers: What are good starting points for someone interested in natural language processing?

This is a very big field. The prerequisites mostly consist of probability/statistics, linear algebra, and basic computer science, although Natural Language Processing requires a more intensive computer science background to start with (frequently covering some basic AI). Regarding specific langauges: Lisp was created "as an afterthought" for doing AI research, while Prolog (with it's roots in formal logic) is especially aimed at Natural Language Processing, and many courses will use Prolog, Scheme, Matlab, R, or another functional language (e.g. OCaml is used for this course at Cornell) as they are very suited to this kind of analysis.

Here are some more specific pointers:

For Machine Learning, **Stanford CS 229: Machine Learning** is great: it includes everything, including full videos of the lectures (also up on iTunes), course notes, problem sets, etc., and it was very well taught by Andrew Ng.

Note the prerequisites:

Students are expected to have the following background: Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program. Familiarity with the basic probability theory. Familiarity with the basic linear algebra.

The course uses Matlab and/or Octave. It also recommends the following readings (although the course notes themselves are very complete):

- Christopher Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
- Richard Duda, Peter Hart and David Stork, Pattern Classification, 2nd ed. John Wiley & Sons, 2001.
- Tom Mitchell, Machine Learning. McGraw-Hill, 1997.
- Richard Sutton and Andrew Barto, Reinforcement Learning: An introduction. MIT Press, 1998

For Natural Language Processing, the NLP group at Stanford provides many good resources. The introductory course **Stanford CS 224: Natural Language Processing** includes all the lectures online and has the following prerequisites:

Adequate experience with programming and formal structures. Programming projects will be written in Java 1.5, so knowledge of Java (or a willingness to learn on your own) is required. Knowledge of standard concepts in artificial intelligence and/or computational linguistics. Basic familiarity with logic, vector spaces, and probability.

Some recommended texts are:

- Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Second Edition. Prentice Hall.
- Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
- James Allen. 1995. Natural Language Understanding. Benjamin/Cummings, 2ed.
- Gerald Gazdar and Chris Mellish. 1989. Natural Language Processing in Prolog. Addison-Wesley. (this is available online for
**free**) - Frederick Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press.

The prerequisite computational linguistics course requires basic computer programming and data structures knowledge, and uses the same text books. The required articificial intelligence course is also available online along with all the lecture notes and uses:

- S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. Second Edition

This is the standard Artificial Intelligence text and is also worth reading.

I use **R** for machine learning myself and really recommend it. For this, I would suggest looking at **The Elements of Statistical Learning**, for which the full text is available online for free. You may want to refer to the Machine Learning and Natural Language Processing views on CRAN for specific functionality.

Jurafsky and Martin's Speech and Language Processing http://www.amazon.com/Speech-Language-Processing-Daniel-Jurafsky/dp/0131873210/ is very good. Unfortunately the draft second edition chapters are no longer free online now that it's been published :(

Also, if you're a decent programmer it's never too early to toy around with NLP programs. NLTK comes to mind (Python). It has a book you can read free online that was published (by OReilly I think).

A developer I am working with is developing a program that analyzes images of pavement to find cracks in the pavement. For every crack his program finds, it produces an entry in a file that tells me which pixels make up that particular crack. There are two problems with his software though:

1) It produces several false positives

2) If he finds a crack, he only finds small sections of it and denotes those sections as being separate cracks.

My job is to write software that will read this data, analyze it, and tell the difference between false-positives and actual cracks. I also need to determine how to group together all the small sections of a crack as one.

I have tried various ways of filtering the data to eliminate false-positives, and have been using neural networks to a limited degree of success to group cracks together. I understand there will be error, but as of now, there is just too much error. Does anyone have any insight for a non-AI expert as to the best way to accomplish my task or learn more about it? What kinds of books should I read, or what kind of classes should I take?

**EDIT** My question is more about how to notice patterns in my coworker's data and identify those patterns as actual cracks. It's the higher-level logic that I'm concerned with, not so much the low-level logic.

**EDIT** In all actuality, it would take AT LEAST 20 sample images to give an accurate representation of the data I'm working with. It varies a lot. But I do have a sample here, here, and here. These images have already been processed by my coworker's process. The red, blue, and green data is what I have to classify (red stands for dark crack, blue stands for light crack, and green stands for a wide/sealed crack).

You should read about data mining, specially pattern mining.

Data mining is the process of extracting patterns from data.As more data are gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform these data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.

A good book on the subject is Data Mining: Practical Machine Learning Tools and Techniques

Basically what you have to do is apply statistical tools and methodologies to your datasets. The most used comparison methodologies are Student's *t-test* and the *Chi squared test*, to see if two unrelated variables are related with some confidence.

What’s the best approach to recognize patterns in data, and what’s the best way to learn more on the topic?

The best approach is to study pattern recognition and machine learning. I would start with Duda's Pattern Classification and use Bishop's Pattern Recognition and Machine Learning as reference. It would take a good while for the material to sink in, but getting basic sense of pattern recognition and major approaches of classification problem should give you the direction. I can sit here and make some assumptions about your data, but honestly you probably have the best idea about the data set since you've been dealing with it more than anyone. Some of the useful technique for instance could be support vector machine and boosting.

**Edit**: An interesting application of boosting is real-time face detection. See Viola/Jones's Rapid Object Detection using a Boosted Cascade of Simple
Features (pdf). Also, looking at the sample images, I'd say you should try improving the edge detection a bit. Maybe smoothing the image with Gaussian and running more aggressive edge detection can increase detection of smaller cracks.

I am wanting some expert guidance here on what the best approach is for me to solve a problem. I have investigated some machine learning, neural networks, and stuff like that. I've investigated weka, some sort of baesian solution.. R.. several different things. I'm not sure how to really proceed, though. Here's my problem.

I have, or will have, a large collection of events.. eventually around 100,000 or so. Each event consists of several (30-50) independent variables, and 1 dependent variable that I care about. Some independent variables are more important than others in determining the dependent variable's value. And, these events are time relevant. Things that occur today are more important than events that occurred 10 years ago.

I'd like to be able to feed some sort of learning engine an event, and have it predict the dependent variable. Then, knowing the real answer for the dependent variable for this event (and all the events that have come along before), I'd like for that to train subsequent guesses.

Once I have an idea of what programming direction to go, I can do the research and figure out how to turn my idea into code. But my background is in parallel programming and not stuff like this, so I'd love to have some suggestions and guidance on this.

Thanks!

Edit: Here's a bit more detail about the problem that I'm trying to solve: It's a pricing problem. Let's say that I'm wanting to predict prices for a random comic book. Price is the only thing I care about. But there are lots of independent variables one could come up with. Is it a Superman comic, or a Hello Kitty comic. How old is it? What's the condition? etc etc. After training for a while, I want to be able to give it information about a comic book I might be considering, and have it give me a reasonable expected value for the comic book. OK. So comic books might be a bogus example. But you get the general idea. So far, from the answers, I'm doing some research on Support vector machines and Naive Bayes. Thanks for all of your help so far.

Sounds like you're a candidate for Support Vector Machines.

Go get libsvm. Read "A practical guide to SVM classification", which they distribute, and is short.

Basically, you're going to take your events, and format them like:

```
dv1 1:iv1_1 2:iv1_2 3:iv1_3 4:iv1_4 ...
dv2 1:iv2_1 2:iv2_2 3:iv2_3 4:iv2_4 ...
```

run it through their svm-scale utility, and then use their grid.py script to search for appropriate kernel parameters. The learning algorithm should be able to figure out differing importance of variables, though you might be able to weight things as well. If you think time will be useful, just add time as another independent variable (feature) for the training algorithm to use.

If libsvm can't quite get the accuracy you'd like, consider stepping up to SVMlight. Only ever so slightly harder to deal with, and a lot more options.

Bishop's Pattern Recognition and Machine Learning is probably the first textbook to look to for details on what libsvm and SVMlight are actually doing with your data.

I wanted to ask Stack Overflow users for a nice idea for a project that could entertain a fellow student programmer during a semester. Computer vision might look interesting, although I couldn't say if a project on that field is something that could be achievable in 4 months. What do you think?

Can't tell without knowing more about you, your friend, and the project. My guess is "no".

I'd point you towards two sources. The first is Peter Norvig's "Artificial Intelligence"; the second is "Programming Collective Intelligence". Maybe they'll inspire you.

There is a story that, during the early days of AI research when significant progress was being made on "hard" logic problems via mechanical theorem provers, a professor assigned one of his graduate students the "easy" problem of solving how vision provided meaningful input to the brain. Obviously, things turned out to be far more difficult than the professor anticipated. So, no, not *vision* in the general sense.

If you are just starting in AI, there are a couple of directions. The classic AI problems - logic puzzles - are solved with a mechanical theorem prover (usually written in Lisp - see here for the classic text on solving logical puzzles). If you don't want to create your own, you can pick up a copy of Prolog (it is essentially the same thing).

You can also go with pattern recognition problems although you'll want to keep the initial problems pretty simple to avoid getting swamped in detail. My dissertation involved the use of stochastic proccesses for letter recognition in free-floating space so I'm kind of partial to this approach (don't start with stochastic processes though, unless you really like math). Right next door is the subfield of neural networks. This is popular because you almost can't learn NN without building some interesting projects. Across this entire domain (pattern processing), the cool thing is that you can solve real problems rather than toy puzzles.

A lot of people like Natural Language Processing as it is easy to get started but almost infinite in complexity. One very definable problem is to build an NLP program for processing language in a specific domain (discussing a chess game, for example). This makes it easy to see progress while still being complex enough to fill a semester.

Hope that gives you some ideas!

I've recently come across Intelligent Agents by reading this book : link text

I'm interested in finding a good book for beginners, so I can start to implement such a system. I've also tried reading "Multiagent Systems : A modern approach to distributed artificial intelligence" (can't find it on amazon) but it's not what I'm looking for. Thanks for the help :).

There is numerous classical books:

- David MacKay's classic (for free here)
- Norvig's AIMA, of which a new version came out recently
- Bishop's Neural Networks for Pattern Recognition
- Bishop's Machine Learning and Pattern Recognition

The first two are the easiest, the second one covers more than machine learning. However, there is little "pragmatic" or "engineering" stuff in there. And the math is quite demanding, but so is the whole field. I guess you will do best with O'Reilly's programming collective intelligence because it has its focus on programming.

I am working on a project of recognizing TV Channels. I am taking photos of the channels suck that i try to avoid the background and to take the sample from the center of the logo. I recognize 4 different logos, here are the templates:

**How does my template matching algorithm work:**

Given 4 templates of size 100x100, each representing a different TV Channel, each having a different threshold (of probability).
The user is capturing the logo from the TV set, and then the algorithm is:
- Run 4 independent template matching on each template to receive the probability for each template to match the captured image.
- for every channel probability, if the probability of a channel is lower then the threshold of the channel, the the probability is changed into 0;
- announce the recognized logo to be the one with highest probability. If all probabilities are 0, announce "no recognition".

For example, if i got one channel with probability of 0.85 and threshold of 0.9, and the second channel with probability of 0.8 and threshold of 0.75, then the second channel "wins".

When i take a photo of one of the logos, 95% of the times it recognizes the photos.

**Current results:**

- When trying to detect the first ("smiling face" logo), out of 10 detections i got 10 correct detections. For the template matching between the correct template and the image i get probabilities between 0.91 to 0.94. For the other logos i get probabilities between 0.77 to 0.91.
- When trying to detect the second ("green" logo), out of 10 detections i got 10 correct detections. For the template matching between the correct template and the image i get probabilities between 0.78 to 0.91. For the other logos i get probabilities between 0.71 to 0.83 (but because of high threshold, the detection succeeds).
- When trying to detect the third ("round" logo), out of 10 detections i got 9 correct detections. For the template matching between the correct template and the image i get probabilities between 0.83 to 0.92. For the other logos i get probabilities between 0.73 to 0.91.
- When trying to detect the fourth ("black and white" logo), out of 10 detections i got 10 correct detections. For the template matching between the correct template and the image i get probabilities between 0.91 to 0.94. For the other logos i get probabilities between 0.78 to 0.92.
- When trying to detect a "negative" image, many times i get a logo detection (which is bad). If i take, for example, an image of a complete white sheet, it detects the first, third and fourth logos with probability of over 0.9

How can i improve my algorithm, or change it, to get better results on "Negative" images?

Thanks for helping,

Eyal

It all depends on how you are calculating the channel probabilities from the templates. Are you using histogram of color or histogram of gradient, and then looking at the histogram difference between your templates and the test images?

Another approach would be to compute feature vectors from test images, such as the concatenation of a histogram of gradients and a histogram of color. Then, manually create a training database, in which you know the label (1, 2, 3, or 4 depending on which label is visible in the image) and you can feed the hard-coded labels, along with the histogram features, into a classifier routine. I recommend LIBSVM for this, and the scikits.learn implementation is easy to use for this in Python.

This will yield a support vector machine classifier that will compare the feature vector of new images with the support vectors from the training set, and determine the right label that is most likely present in the image. You can then fit a logistic model over top of this SVM if you want something that yields probabilities rather than just predicted labels.

Two good books to read to get started in this type of machine learning are Pattern Classification, by Duda, Hart, and Stork, and Pattern Recognition and Machine Learning by Bishop.

Some messy Python code that I wrote for implementing Poselets and Histogram of Oriented Gradients in Python can be found linked here; maybe you can grab some sections of code in there and it will be suitable for your task.

I am looking for some entry level posts about machine learning. Can anyone suggest anything for somebody new to this subject?

I think the best that I know of are:

Stanford's Lectures on Machine Learning

Books: (In decreasing order of ease of understanding - IMHO)

Machine Learning: An Algorithmic Perspective by Stephen Marsland

Pattern Recognition and Machine Learning by Christopher Bishop

Autoencoders actually reconstructs the original input and also it helps in dimensionality reduction as the number of hidden neurons is less compared to the number of input neurons. My question is how output values are generated from the hidden neuron values? **WHAT IS THE MATHEMATICAL FORMULA THAT IS USED TO CALCULATE THE FINAL OUTPUT VALUES(starting from input to hidden and hidden to output).** PLEASE anyone help me with this. I have tried mathematically,but I am not getting the output as same as the input values.

There isn't a set, single, way to feed-forward neural nets - it's a general technique. One popular thing to do is `logistic(W*In)`

, where `W*In`

is the dot product of the node's weights and the input nodes' activations, and `logistic(x) = 1/(1+e^-x)`

. There are many, many subtleties to applying this method, and the "meat" of the technique is how you determine / train the weights `W`

for each node. I recommend getting a good text on machine learning / neural networks, perhaps - (even if it's not specifically talking about autoencoders, the general techniques used for multi-layer nets will be similar): http://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738/ref=zg_bs_3894_3

I am working with some points which are very compact together and therefore forming clusters amongst them is proving very difficult. Since I am new to this concept, I read in a paper about the concept of Gaussian weighting the points randomly or rather resampling using gaussian weight.

My question here is how are gaussian weight applied to the data points? Is it the actual normal distribution where I have to compute the means and the variance and SD and than randomly sample or there is other ways to do it. I am confused on this concept?

Can I get some hints on the concept please

I think you should look at book: http://www.amazon.com/Pattern-Recognition-Learning-Information-Statistics/dp/0387310738 There is a good chapters on modeling point distributions.

Im estimating the parameters of a GMM using EM .:

When I use my Matlab script And run the EM code i get a **single value of "log-likelihood**"..

However in opencv the output of **EM.train** gives a matrix which containd the **loglikelihood value of every sample**.How do I get a single log likelihood value?(Do I need to take the minimum of all the loglikelihood values of all samples or the sum of all loglikelihood values?)..

Hope someone can help me out!

You need sum of log probabilities of datapoints which you use to estimate probability density function. You'll get loglikelihood of your estimation.

You can find good explanation in "Pattern Recognition and Machine Learning" book

I'm looking for a really good tutorial on machine learning for text classification perhaps using Support vector machine (SVM) or other appropriate technology for large-scale supervised text classification. If there isn't a great tutorial, can anyone give me pointers to how a beginner should get started and do a good job with things like feature detection for English language Text Classification.

Books, articles, anything that can help beginners get started would be super helpful!

In its classical flavour the Support Vector Machine (SVM) is a binary classifier (i.e., it solves classification problems involving two classes). However, it can be also used to solve multi-class classification problems by applying techniques likes One versus One, One Versus All or Error Correcting Output Codes [Alwein et al.]. Also recently, a new modification of the classical SVM the multiclass-SVM allows to solve directly multi-class classification problems [Crammer et al.].

Now as far as it concerns document classification, your main problem is feature extraction (i.e., how to acquire certain classification features from your documents). This is not a trivial task and there's a batch of bibliography on the topic (e.g., [Rehman et al.], [Lewis]).

Once you've overcome the obstacle of feature extraction, and have labeled and placed your document samples in a feature space you can apply any classification algorithm like SVMs, AdaBoost e.t.c.

Introductory books on machine learning: [Flach], [Mohri], [Alpaydin], [Bishop], [Hastie]

Books specific for SVMs: [Schlkopf], [Cristianini]

Some specific bibliography on document classification and SVMs: [Miner et al.], [Srivastava et al.], [Weiss et al.], [Pilászy], [Joachims], [Joachims01], [Joachims97], [Sassano]