Richard O. Duda, Peter E. Hart, David G. Stork

Mentioned 8

This edition has been completely revised, enlarged and formatted in two colour. It is a systematic account of the major topics in pattern recognition, based on the fundamental principles. It includes extensive examples, exercises and a solutions manual.

Mentioned in questions and answers.

I'm a programmer with a decent background in math and computer science. I've studied computability, graph theory, linear algebra, abstract algebra, algorithms, and a little probability and statistics (through a few CS classes) at an undergraduate level.

I feel, however, that I don't know enough about statistics. Statistics are increasingly useful in computing, with statistical natural language processing helping fuel some of Google's algorithms for search and machine translation, with performance analysis of hardware, software, and networks needing proper statistical grounding to be at all believable, and with fields like bioinformatics becoming more prevalent every day.

I've read about how "Google uses Bayesian filtering the way Microsoft uses the if statement", and I know the power of even fairly naïve, simple statistical approaches to problems from Paul Graham's A Plan for Spam and Better Bayesian Filtering, but I'd like to go beyond that.

I've tried to look into learning more statistics, but I've gotten a bit lost. The Wikipedia article has a long list of related topics, but I'm not sure which I should look into. I feel like from what I've seen, a lot of statistics makes the assumption that everything is a combination of factors that linearly combine, plus some random noise in a Gaussian distribution; I'm wondering what I should learn beyond linear regression, or if I should spend the time to really understand that before I move on to other techniques. I've found a few long lists of books to look at; where should I start?

So I'm wondering where to go from here; what to learn, and where to learn it. In particular, I'd like to know:

- What kind of problems in programming, software engineering, and computer science are statistical methods well suited for? Where am I going to get the biggest payoffs?
- What kind of statistical methods should I spend my time learning?
- What resources should I use to learn this? Books, papers, web sites. I'd appreciate a discussion of what each book (or other resource) is about, and why it's relevant.

To clarify what I am looking for, I am interested in what problems that *programmers* typically need to deal with can benefit from a statistical approach, and what kind of statistical tools can be useful. For instance:

- Programmers frequently need to deal with large databases of text in natural languages, and help to categorize, classify, search, and otherwise process it. What statistical techniques are useful here?
- More generally, artificial intelligence has been moving away from discrete, symbolic approaches and towards statistical techniques. What statistical AI approaches have the most to offer now, to the working programmer (as opposed to ongoing research that may or may not provide concrete results)?
- Programmers are frequently asked to produce high-performance systems, that scale well under load. But you can't really talk about performance unless you can measure it. What kind of experimental design and statistical tools do you need to use to be able to say with confidence that the results are meaningful?
- Simulation of physical systems, such as in computer graphics, frequently involves a stochastic approach.
- Are there other problems commonly encountered by programmers that would benefit from a statistical approach?

I'm surprised no one has mentioned a keen understanding of graphics as essential to good statistical practice. Machine learning and Bayesian analysis are great (try Gelman's book if you want a formal but approachable and applied introduction to Bayes), but you can get amazingly far at understanding a problem with really good visualizations. Tufte's classic is a good place to start, and the classic semiology and grammar of graphics books are worth a read. Finally, take a look at the R ggplot2 package for a simple way to begin implementing complex graphical ideas.

Interesting question. As a statistician whose interest is more and more aligned with computer science perhaps I could provide a few thoughts...

**Don't learn frequentist hypothesis testing.**While the bulk of my work is done in this paradigm, it doesn't match the needs of business or data mining. Scientists generally have specific hypotheses in mind, and might wish to gauge the probability that, given their hypothesis isn't true, the data would be as extreme as it is. This is rarely the type of answer a computer scientist wants.**Bayesian is useful**, even if you don't know why you are assuming the priors that you are using. A baysian analysis can give you a precise probability estimate for various contingencies, but it is important to realize that the only reason you have this precise estimate is because you made a fuzzy decision regarding the prior probability. (For those not in the know, with baysian inference, you can specify an arbitrary prior probability, and update this based on the data collected to get a better estimate).

**Machine learning and classification** might be a good place to get started. The machine learning literature is more focused on computer science problems, though it's mission is almost identical to that of statistics ( see: http://anyall.org/blog/2008/12/statistics-vs-machine-learning-fight/ ).

Since you spoke of large databases with large numbers of variables, here are a few algorithms that come in handy in this domain.

- adaboost: If you have a large number of crappy classifiers, and want to make one good classifier. (see also logit boost)
- Support Vector Machines: A powerful and flexible classifier. Can learn non-linear patterns (okay linear in the non-linear kernel space if you want to be picky about it).
- k-nearest neighbor: A simple but powerful algorithm. It does not scale well, but there are approximate nearest neighbor alternatives that are not quite so pathological.
- CART: This algorithm partitions the data based on a number of predictor variables. It is particularly good if there are variable interactions, or there exists a very good predictor that only works on a subset of the data.
- Least angle regression: if the value that you are trying to predict is continuous and you have a lot of data and a lot of predictors.

This is by no means complete, but should give you a good jumping off point. A very good and accessible book on the subject is Duda, Hart, Stork: Pattern Classification

Also, a big part of statistics is descriptive visualizations and analysis. These are of particular interest to the programmer because they allow him/her to convey information back to the user. In R, ggplot2 is my package of choice for creating visualizations. On the descriptive analysis side (and useful in text analysis) is multi-dimensional scaling, which can give a spacial interpretation of non-spacial data (for example the ideologies of senators http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoas/1223908041).

Boy, some of these answers are good. I came from much the same background and have had to get into biostatistics largely by books and by osmosis from colleagues. Here are my recommendations:

Start with a solid grounding in probability, including conditional probability, Bayes' theorem, Markov models, and some of the basic statistical distributions.

If you don't have it, get some linear algebra, so you don't get scared off by matrices. If you are faced with tricky algebra and calculus, knuckle down and work through it. It's worth it.

Statistics theory falls into two camps, frequentist and Bayesian. Frequentist is older and solid. Bayesian is newer, more flexible, and more exciting. In particular, there are the exciting things that can be done with Markov Chain Monte Carlo and related techniques.

In my area, pharmacometrics, there is high payoff in being able to extract meaningful results from sparse and expensive data, so an ability in statistics is very important.

Added: Here are some favorite books (not a complete list):

"Statistical Distributions", 2nd Edition, by Evans, Hastings, and Peacock. It gives a very complete list of distributions, tells how they are related, how to sample and estimate them. I use it daily.

"Matrices and Linear Algebra", Schneider, and Barker. Very good and easy to follow.

"Bayesian Statistics: An Introduction", 2nd Edition, by Lee. Very concise and easy to read.

"Markov Chain Monte Carlo in Practice", by Gilks, Richardson, and Spiegelhalter. Loved this book, and these are the guys who make WinBugs, a terrific free package.

More probability than statistics, but Bayesian Probabilty can be very useful (it underpins spam filters) and IMO more software should use it to infer a user's habits.

Head First Statistics is an excellent book to learn statistics (a mathematician/statistician informs me that it has not so much a few errors but a few simplications of the theoretical stuff).

I almost forgot to mention: How to Lie with Statistics

Just as a point, not as a critic, but your question should be formulated in a different way: "what statistics should *any* person know?".

Fact is, unfortunately we all deal with statistics. It's a fact of life. Polls, weather forecast, drug effectiveness, insurances, and of course some parts of computer science. Being able to critically analyze the presented data gives the line between picking the right understanding or being scammed, whatever that means.

Said that, I think the following points are important to understand

- mean, median, standard deviation of a sample, and the difference between
*sample*and*population*(this is very important) - the distributions, and why the gaussian distribution is so important (the central limit theorem)
- What it is meant with Null Hypothesis testing.
- What is variable transformation, correlation, regression, multivariate analysis.
- What is bayesian statistics.
- Plotting methods.

All these points are critical not only to you as a computer scientist, but also as a human being. I will give you some examples.

The evaluation of the null hypothesis is critical for testing of the effectiveness of a method. For example, if a drug works, or if a fix to your hardware had a concrete result or it's just a matter of chance. Say you want to improve the speed of a machine, and change the hard drive. Does this change matters? you could do sampling of performance with the old and new hard disk, and check for differences. Even if you find that the average with the new disk is lower, that does not mean the hard disk has an effect at all. Here enters Null hypothesis testing, and it will give you a confidence interval, not a definitive answer, like : there's a 90 % probability that changing the hard drive has a concrete effect on the performance of your machine.

Correlation is important to find out if two entities "change alike". As the internet mantra "correlation is not causation" teaches, it should be taken with care. The fact that two random variables show correlation does not mean that one causes the other, nor that they are related by a third variable (which you are not measuring). They could just behave in the same way. Look for pirates and global warming to understand the point. A correlation reports a

*possible*signal, it does not report a finding.Bayesian. We all know the spam filter. but there's more. Suppose you go to a medical checkup and the result tells you have cancer (I seriously hope not, but it's to illustrate a point). Fact is: most of the people at this point would think "I have cancer". That's not true. A positive testing for cancer moves your probability of having cancer from the baseline for the population (say, 8 per thousands people have cancer, picked out of thin air number) to a higher value, which is not 100 %. How high is this number depends on the accuracy of the test. If the test is lousy, you could just be a false positive. The more accurate the method, the higher is the skew, but still not 100 %. Of course, if multiple independent tests all confirm that you have cancer, then it's very probable you actually have it, but still it's not 100 %. maybe it's 99.999 %. This is a point many people don't understand about bayesian statistics.

Plotting methods. That's another thing that is always left unattended. Analysis of data does not mean anything if you cannot convey effectively what they mean via a simple plot. Depending on what information you want to put into focus, or the kind of data you have, you will prefer a xy plot, a histogram, a violin plot, or a pie chart.

Now, let's go to your questions. I think I overindulged in just a quick note, but since my answer was voted up quite a lot, I feel it's better if I answer properly to your questions as much as my knowledge allows (and here is vacation, so I can indulge as much as I want over it)

What kind of problems in programming, software engineering, and computer science are statistical methods well suited for? Where am I going to get the biggest payoffs?

Normally, everything that has to do with data comparison which involves numerical (or reduced to numerical) input from unreliable sources. A signal from an instrument, a bunch of pages and the number of words they contain. When you get these data, and have to find a distilled answer out of the bunch, then you need statistics. Think for example to the algorithm to perform click detection on the iphone. You are using a trembling, fat stylus to refer to an icon which is much smaller than the stylus itself. Clearly, the hardware (capacitive screen) will send you a bunch of data about the finger, plus a bunch of data about random noise (air? don't know how it works). The driver must make sense out of this mess and give you a x,y coordinate on the screen. That needs (a lot of) statistics.

What kind of statistical methods should I spend my time learning?

The ones I told you are more than enough, also because to understand them, you have to walk through other stuff.

What resources should I use to learn this? Books, papers, web sites. I'd appreciate a discussion of what each book (or other resource) is about, and why it's relevant.

I learned statistics mostly from standard university courses. My first book was the "train wreck book", and it's very good. I also tried this one, which focuses on R but it did not satisfy me particularly. You have to know things and R to get through it.

Programmers frequently need to deal with large databases of text in natural languages, and help to categorize, classify, search, and otherwise process it. What statistical techniques are useful here?

That depends on the question you need to answer using your dataset.

Programmers are frequently asked to produce high-performance systems, that scale well under load. But you can't really talk about performance unless you can measure it. What kind of experimental design and statistical tools do you need to use to be able to say with confidence that the results are meaningful?

There are a lot of issues with measuring. Measuring is a fine and delicate art. Proper measuring is almost beyond human. The fact is that sampling introduces bias, either from the sampler, or from the method, or from the nature of the sample, or from the nature of nature. A good sampler knows these things and tries to reduce unwanted bias as much into a random distribution.

The examples from the blog you posted are relevant. Say you have a startup time for a database. If you take performance measures within that time, all your measures will be biased. There's no statistical method that can tell you this. Only your knowledge of the system can.

Are there other problems commonly encountered by programmers that would benefit from a statistical approach?

Every time you have an ensemble of data producers, you have statistics, so scientific computing and data analysis is obviously one place. Folksonomy and social networking is pretty much all statistics. Even stackoverflow is, in some sense, statistical. The fact that an answer is highly voted does not mean that it's the right one. It means that there's a high probability that is right, according to the evaluation of a statistical ensemble of independent evaluators. How these evaluators behave make the difference between stackoverflow, reddit and digg.

What a great thread. There's plenty of good information in the question itself and in the answers, but I am really surprised nobody has mentioned the book Programming Collective Intelligence yet.

It's the best book I know if you are a novice in this subject (like me) and want to put machine learning and statistics theory into practice.

This book explains:

- Collaborative filtering techniques that enable online retailers to recommend products or media
- Methods of clustering to detect groups of similar items in a large dataset
- Search engine features--crawlers, indexers, query engines, and the PageRank algorithm
- Optimization algorithms that search millions of possible solutions to a problem and choose the best one
Bayesian filtering, used in spam filters for classifying documents based on word types and other features

Using decision trees not only to make predictions, but to model the way decisions are made

- Predicting numerical values rather than classifications to build price models
- Support vector machines to match people in online dating sites
- Non-negative matrix factorization to find the independent features in adataset
- Evolving intelligence for problem solving--how a computer develops its skill by improving its own code the more it plays a game

Apart from that, there's a great talk on TED on why everybody should learn Statistics.

Assume you know a student who wants to study Machine Learning and Natural Language Processing.

What introductory subjects would you recommend?

**Example**: I'm guessing that knowing Prolog and Matlab might help him. He also might want to study Discrete Structures*, Calculus, and Statistics.

*Graphs and trees. Functions: properties, recursive definitions, solving recurrences. Relations: properties, equivalence, partial order. Proof techniques, inductive proof. Counting techniques and discrete probability. Logic: propositional calculus, first-order predicate calculus. Formal reasoning: natural deduction, resolution. Applications to program correctness and automatic reasoning. Introduction to algebraic structures in computing.

This related stackoverflow question has some nice answers: What are good starting points for someone interested in natural language processing?

This is a very big field. The prerequisites mostly consist of probability/statistics, linear algebra, and basic computer science, although Natural Language Processing requires a more intensive computer science background to start with (frequently covering some basic AI). Regarding specific langauges: Lisp was created "as an afterthought" for doing AI research, while Prolog (with it's roots in formal logic) is especially aimed at Natural Language Processing, and many courses will use Prolog, Scheme, Matlab, R, or another functional language (e.g. OCaml is used for this course at Cornell) as they are very suited to this kind of analysis.

Here are some more specific pointers:

For Machine Learning, **Stanford CS 229: Machine Learning** is great: it includes everything, including full videos of the lectures (also up on iTunes), course notes, problem sets, etc., and it was very well taught by Andrew Ng.

Note the prerequisites:

Students are expected to have the following background: Knowledge of basic computer science principles and skills, at a level sufficient to write a reasonably non-trivial computer program. Familiarity with the basic probability theory. Familiarity with the basic linear algebra.

The course uses Matlab and/or Octave. It also recommends the following readings (although the course notes themselves are very complete):

- Christopher Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
- Richard Duda, Peter Hart and David Stork, Pattern Classification, 2nd ed. John Wiley & Sons, 2001.
- Tom Mitchell, Machine Learning. McGraw-Hill, 1997.
- Richard Sutton and Andrew Barto, Reinforcement Learning: An introduction. MIT Press, 1998

For Natural Language Processing, the NLP group at Stanford provides many good resources. The introductory course **Stanford CS 224: Natural Language Processing** includes all the lectures online and has the following prerequisites:

Adequate experience with programming and formal structures. Programming projects will be written in Java 1.5, so knowledge of Java (or a willingness to learn on your own) is required. Knowledge of standard concepts in artificial intelligence and/or computational linguistics. Basic familiarity with logic, vector spaces, and probability.

Some recommended texts are:

- Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Second Edition. Prentice Hall.
- Christopher D. Manning and Hinrich Schütze. 1999. Foundations of Statistical Natural Language Processing. MIT Press.
- James Allen. 1995. Natural Language Understanding. Benjamin/Cummings, 2ed.
- Gerald Gazdar and Chris Mellish. 1989. Natural Language Processing in Prolog. Addison-Wesley. (this is available online for
**free**) - Frederick Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press.

The prerequisite computational linguistics course requires basic computer programming and data structures knowledge, and uses the same text books. The required articificial intelligence course is also available online along with all the lecture notes and uses:

- S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach. Second Edition

This is the standard Artificial Intelligence text and is also worth reading.

I use **R** for machine learning myself and really recommend it. For this, I would suggest looking at **The Elements of Statistical Learning**, for which the full text is available online for free. You may want to refer to the Machine Learning and Natural Language Processing views on CRAN for specific functionality.

Jurafsky and Martin's Speech and Language Processing http://www.amazon.com/Speech-Language-Processing-Daniel-Jurafsky/dp/0131873210/ is very good. Unfortunately the draft second edition chapters are no longer free online now that it's been published :(

Also, if you're a decent programmer it's never too early to toy around with NLP programs. NLTK comes to mind (Python). It has a book you can read free online that was published (by OReilly I think).

A developer I am working with is developing a program that analyzes images of pavement to find cracks in the pavement. For every crack his program finds, it produces an entry in a file that tells me which pixels make up that particular crack. There are two problems with his software though:

1) It produces several false positives

2) If he finds a crack, he only finds small sections of it and denotes those sections as being separate cracks.

My job is to write software that will read this data, analyze it, and tell the difference between false-positives and actual cracks. I also need to determine how to group together all the small sections of a crack as one.

I have tried various ways of filtering the data to eliminate false-positives, and have been using neural networks to a limited degree of success to group cracks together. I understand there will be error, but as of now, there is just too much error. Does anyone have any insight for a non-AI expert as to the best way to accomplish my task or learn more about it? What kinds of books should I read, or what kind of classes should I take?

**EDIT** My question is more about how to notice patterns in my coworker's data and identify those patterns as actual cracks. It's the higher-level logic that I'm concerned with, not so much the low-level logic.

**EDIT** In all actuality, it would take AT LEAST 20 sample images to give an accurate representation of the data I'm working with. It varies a lot. But I do have a sample here, here, and here. These images have already been processed by my coworker's process. The red, blue, and green data is what I have to classify (red stands for dark crack, blue stands for light crack, and green stands for a wide/sealed crack).

You should read about data mining, specially pattern mining.

Data mining is the process of extracting patterns from data.As more data are gathered, with the amount of data doubling every three years, data mining is becoming an increasingly important tool to transform these data into information. It is commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud detection and scientific discovery.

A good book on the subject is Data Mining: Practical Machine Learning Tools and Techniques

Basically what you have to do is apply statistical tools and methodologies to your datasets. The most used comparison methodologies are Student's *t-test* and the *Chi squared test*, to see if two unrelated variables are related with some confidence.

What’s the best approach to recognize patterns in data, and what’s the best way to learn more on the topic?

The best approach is to study pattern recognition and machine learning. I would start with Duda's Pattern Classification and use Bishop's Pattern Recognition and Machine Learning as reference. It would take a good while for the material to sink in, but getting basic sense of pattern recognition and major approaches of classification problem should give you the direction. I can sit here and make some assumptions about your data, but honestly you probably have the best idea about the data set since you've been dealing with it more than anyone. Some of the useful technique for instance could be support vector machine and boosting.

**Edit**: An interesting application of boosting is real-time face detection. See Viola/Jones's Rapid Object Detection using a Boosted Cascade of Simple
Features (pdf). Also, looking at the sample images, I'd say you should try improving the edge detection a bit. Maybe smoothing the image with Gaussian and running more aggressive edge detection can increase detection of smaller cracks.

I'm looking for computationally heavy tasks to implement with CUDA and wonder if neural networks or bayesian networks might apply. This is not my question, though, but rather what the relation between the two network types is. They seem very related, especially if you look at bayesian networks with a learning capability (which the article on wikipedia mentions). At a glance, bayesian networks look at bit like a specific type of neural networks. Can anyone sum up their relationship, and if there is any connection beyond the apparent similarity?

Indeed they are. I see a bayesian network as a neural network applying the Baye's Theorem on large scale, but I don't remember details. I know where you can find them and I recommend this book for that.

I am working on a project of recognizing TV Channels. I am taking photos of the channels suck that i try to avoid the background and to take the sample from the center of the logo. I recognize 4 different logos, here are the templates:

**How does my template matching algorithm work:**

Given 4 templates of size 100x100, each representing a different TV Channel, each having a different threshold (of probability).
The user is capturing the logo from the TV set, and then the algorithm is:
- Run 4 independent template matching on each template to receive the probability for each template to match the captured image.
- for every channel probability, if the probability of a channel is lower then the threshold of the channel, the the probability is changed into 0;
- announce the recognized logo to be the one with highest probability. If all probabilities are 0, announce "no recognition".

For example, if i got one channel with probability of 0.85 and threshold of 0.9, and the second channel with probability of 0.8 and threshold of 0.75, then the second channel "wins".

When i take a photo of one of the logos, 95% of the times it recognizes the photos.

**Current results:**

- When trying to detect the first ("smiling face" logo), out of 10 detections i got 10 correct detections. For the template matching between the correct template and the image i get probabilities between 0.91 to 0.94. For the other logos i get probabilities between 0.77 to 0.91.
- When trying to detect the second ("green" logo), out of 10 detections i got 10 correct detections. For the template matching between the correct template and the image i get probabilities between 0.78 to 0.91. For the other logos i get probabilities between 0.71 to 0.83 (but because of high threshold, the detection succeeds).
- When trying to detect the third ("round" logo), out of 10 detections i got 9 correct detections. For the template matching between the correct template and the image i get probabilities between 0.83 to 0.92. For the other logos i get probabilities between 0.73 to 0.91.
- When trying to detect the fourth ("black and white" logo), out of 10 detections i got 10 correct detections. For the template matching between the correct template and the image i get probabilities between 0.91 to 0.94. For the other logos i get probabilities between 0.78 to 0.92.
- When trying to detect a "negative" image, many times i get a logo detection (which is bad). If i take, for example, an image of a complete white sheet, it detects the first, third and fourth logos with probability of over 0.9

How can i improve my algorithm, or change it, to get better results on "Negative" images?

Thanks for helping,

Eyal

It all depends on how you are calculating the channel probabilities from the templates. Are you using histogram of color or histogram of gradient, and then looking at the histogram difference between your templates and the test images?

Another approach would be to compute feature vectors from test images, such as the concatenation of a histogram of gradients and a histogram of color. Then, manually create a training database, in which you know the label (1, 2, 3, or 4 depending on which label is visible in the image) and you can feed the hard-coded labels, along with the histogram features, into a classifier routine. I recommend LIBSVM for this, and the scikits.learn implementation is easy to use for this in Python.

This will yield a support vector machine classifier that will compare the feature vector of new images with the support vectors from the training set, and determine the right label that is most likely present in the image. You can then fit a logistic model over top of this SVM if you want something that yields probabilities rather than just predicted labels.

Two good books to read to get started in this type of machine learning are Pattern Classification, by Duda, Hart, and Stork, and Pattern Recognition and Machine Learning by Bishop.

Some messy Python code that I wrote for implementing Poselets and Histogram of Oriented Gradients in Python can be found linked here; maybe you can grab some sections of code in there and it will be suitable for your task.

I was reading stuffs about pattern recognition. Recently I want to make a survey of methods to evaluate similarities of vectors. As far as I know, there are Euclidean distances, Mahalanobis distances and Cosine Distance. Can anyone present some more names or keywords to search?

Also mutual neighbor distance (MND), Minkowski metric, Hausdorff distance, conceptual similarity, normalized Google distance, KL divergence, Spearman’s rank correlation, and Lin similarity. (Not all of these are vector based.)

I highly recommend Pattern Classification by Duda, Hart, and Stork for further reading. It is extensively cited.

I'm trying to build an app to detect images which are advertisements from the webpages. Once I detect those I`ll not be allowing those to be displayed on the client side.

Basically I'm using Back-propagation algorithm to train the neural network using the dataset given here: http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements.

But in that dataset no. of attributes are very high. In fact one of the mentors of the project told me that If you train the Neural Network with that many attributes, it'll take lots of time to get trained. So is there a way to optimize the input dataset? Or I just have to use that many attributes?

If you're actually using a backpropagation network with 1558 input nodes and only 3279 samples, then the training time is the least of your problems: Even if you have a very small network with only one hidden layer containing 10 neurons, you have 1558*10 weights between the input layer and the hidden layer. How can you expect to get a good estimate for 15580 degrees of freedom from only 3279 samples? (And that simple calculation doesn't even take the "curse of dimensionality" into account)

You have to analyze your data to find out how to optimize it. Try to understand your input data: Which (tuples of) features are (jointly) statistically significant? (use standard statistical methods for this) Are some features redundant? (Principal component analysis is a good stating point for this.) Don't expect the artificial neural network to do that work for you.

Also: remeber Duda&Hart's famous "no-free-lunch-theorem": No classification algorithm works for every problem. And for any classification algorithm X, there is a problem where flipping a coin leads to better results than X. If you take this into account, deciding what algorithm to use before analyzing your data might not be a smart idea. You might well have picked the algorithm that actually performs worse than blind guessing on your specific problem! (By the way: Duda&Hart&Storks's book about pattern classification is a great starting point to learn about this, if you haven't read it yet.)

I am doing my graduation project in the field of computer vision, and i have only taken one course in statistics that discussed very basic concepts, and now i am facing more difficulty in rather advanced topics, so i need help (book, tutorial, course, ..etc) to grasp and review the basic ideas and concepts in statistics and then dive into the details (**statistical details**) used in computer vision.