Foundations of Statistical Natural Language Processing

Christopher D. Manning, Hinrich Schütze

Mentioned 9

An introduction to statistical natural language processing (NLP). The text contains the theory and algorithms needed for building NLP tools. Topics covered include: mathematical and linguistic foundations; statistical methods; collocation finding; word sense disambiguation; and probalistic parsing.

More on

Mentioned in questions and answers.

Possible Duplicate:
How does the Google “Did you mean?” Algorithm work?

Suppose you have a search system already in your website. How can you implement the "Did you mean:<spell_checked_word>" like Google does in some search queries?

Actually what Google does is very much non-trivial and also at first counter-intuitive. They don't do anything like check against a dictionary, but rather they make use of statistics to identify "similar" queries that returned more results than your query, the exact algorithm is of course not known.

There are different sub-problems to solve here, as a fundamental basis for all Natural Language Processing statistics related there is one must have book: Foundation of Statistical Natural Language Processing.

Concretely to solve the problem of word/query similarity I have had good results with using Edit Distance, a mathematical measure of string similarity that works surprisingly well. I used to use Levenshtein but the others may be worth looking into.

Soundex - in my experience - is crap.

Actually efficiently storing and searching a large dictionary of misspelled words and having sub second retrieval is again non-trivial, your best bet is to make use of existing full text indexing and retrieval engines (i.e. not your database's one), of which Lucene is currently one of the best and coincidentally ported to many many platforms.

Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text. The goal is to get a "general feel" of what people are saying over a set of textual comments. Along the lines of Wordle.

What I'd like:

  • ignore articles, pronouns, etc ('a', 'an', 'the', 'him', 'them' etc)
  • preserve proper nouns
  • ignore hyphenation, except for soft kind

Reaching for the stars, these would be peachy:

  • handling stemming & plurals (e.g. like, likes, liked, liking match the same result)
  • grouping of adjectives (adverbs, etc) with their subjects ("great service" as opposed to "great", "service")

I've attempted some basic stuff using Wordnet but I'm just tweaking things blindly and hoping it works for my specific data. Something more generic would be great.

Welcome to the world of NLP ^_^

All you need is a little basic knowledge and some tools.

There are already tools that will tell you if a word in a sentence is a noun, adjective or verb. They are called part-of-speech taggers. Typically, they take plaintext English as input, and output the word, its base form, and the part-of-speech. Here is the output of a popular UNIX part-of-speech tagger on the first sentence of your post:

$ echo "Without getting a degree in information retrieval, I'd like to know if there exists any algorithms for counting the frequency that words occur in a given body of text." | tree-tagger-english 
# Word  POS     surface form
Without IN  without
getting VVG get
a   DT  a
degree  NN  degree
in  IN  in
information NN  information
retrieval   NN  retrieval
,   ,   ,
I   PP  I
'd  MD  will
like    VV  like
to  TO  to
know    VV  know
if  IN  if
there   EX  there
exists  VVZ exist
any DT  any
algorithms  NNS algorithm
for IN  for
counting    VVG count
the DT  the
frequency   NN  frequency
that    IN/that that
words   NNS word
occur   VVP occur
in  IN  in
a   DT  a
given   VVN give
body    NN  body
of  IN  of
text    NN  text
.   SENT    .

As you can see, it identified "algorithms" as being the plural form (NNS) of "algorithm" and "exists" as being a conjugation (VBZ) of "exist." It also identified "a" and "the" as "determiners (DT)" -- another word for article. As you can see, the POS tagger also tokenized the punctuation.

To do everything but the last point on your list, you just need to run the text through a POS tagger, filter out the categories that don't interest you (determiners, pronouns, etc.) and count the frequencies of the base forms of the words.

Here are some popular POS taggers:

TreeTagger (binary only: Linux, Solaris, OS-X)
GENIA Tagger (C++: compile your self)
Stanford POS Tagger (Java)

To do the last thing on your list, you need more than just word-level information. An easy way to start is by counting sequences of words rather than just words themselves. These are called n-grams. A good place to start is UNIX for Poets. If you are willing to invest in a book on NLP, I would recommend Foundations of Statistical Natural Language Processing.

I am planning to learn natural language processing this year.

But when I start reading introductory books on this topic, I found that I miss a lot of points relating mainly to mathematics.

So I'm here searching for what I should learn before I can learn nlp, well, more smoothly?

Thanks in advance.

There are two main approaches to NLP right now - one is the language-based approach detailed by Jurafsky and Martin (Speech and Language Processing) and the other is a probability and statistics-based approach (Foundations of Statistical Natural Language Processing).

Most people that I've talked to tend to prefer the latter as far as ease of ramping up and useful results. So I would recommend going over probability theory first and then tackling an NLP book (like the second one I linked to, which I am actually using on a project right now with pretty good results).

While I agree with laura that formal language theory is highly useful, I actually think that currently if you just want to get into the actual NL parts of NLP, you can leave formal languages for later as there are enough tools that will do your lexical analysis / parsing / tokenizing / text transformations that you can use those rather than roll your own.

Here is a book describing three such tools - I own it and recommend it as a good introduction to all three. Building Search Applications: Lucene, LingPipe, and Gate

Edit: in response to your question, I would say that the first step would be to get a thorough grounding in the basics of probability (the first 3-5 chapters of any undergrad prob/stats book should be fine), and then from there look up new topics as they come up in the NLP book. For instance, yesterday I had to learn about t-values or something (I'm bad with names) because they happened to be relevant to determining incidence of collocation.

I'm interested in learning more about Natural Language Processing (NLP) and am curious if there are currently any strategies for recognizing proper nouns in a text that aren't based on dictionary recognition? Also, could anyone explain or link to resources that explain the current dictionary-based methods? Who are the authoritative experts on NLP or what are the definitive resources on the subject?

The task of determining the proper part of speech for a word in a text is called Part of Speech Tagging. The Brill tagger, for example, uses a mixture of dictionary(vocabulary) words and contextual rules. I believe that some of the important initial dictionary words for this task are the stop words. Once you have (mostly correct) parts of speech for your words, you can start building larger structures. This industry-oriented book differentiates between recognizing noun phrases (NPs) and recognizing named entities. About textbooks: Allen's Natural Language Understanding is a good, but a bit dated, book. Foundations of Statistical Natural Language Processing is a nice introduction to statistical NLP. Speech and Language Processing is a bit more rigorous and maybe more authoritative. The Association for Computational Linguistics is a leading scientific community on computational linguistics.

I know this is not a straight up question, so if you need me to provide more information about the scope of it, let me know. There are a bunch of questions that address almost the same issue (they are linked here), but never the exact same one with the same kind of scope and objective - at least as far as I know.


  • I have a MP3 file with ID3 tags for artist name and song title.
  • I have two tables Artists and Songs
  • The ID3 tags might be slightly off (e.g. Mikaell Jacksonne)
  • I'm using ASP.NET + C# and a MSSQL database

I need to synchronize the MP3s with the database. Meaning:

  1. The user launches a script
  2. The script browses through all the MP3s
  3. The script says "Is 'Mikaell Jacksonne' 'Michael Jackson' YES/NO"
  4. The user pick and we start over

Examples of what the system could find:

In the database...

SONGS = {"This is a great song title", "This is a song title"}
ARTISTS = {"Michael Jackson"}


"This is a grt song title" did you mean "This is a great song title" ?
"This is song title" did you mean "This is a song title" ?
"This si a song title"  did you mean "This is a song title" ?
"This si song a title"  did you mean "This is a song title" ?
"Jackson, Michael" did you mean "Michael Jackson" ?
"JacksonMichael" did you mean "Michael Jackson" ?
"Michael Jacksno" did you mean "Michael Jackson" ?


I read some documentation from this /how-do-you-implement-a-did-you-mean and this is not exactly what I need since I don't want to check an entire dictionary. I also can't really use a web service since it's depending a lot on what I already have in my database. If possible I'd also like to avoid dealing with distances and other complicated things.

I could use the google api (or something similar) to do this, meaning that the script will try spell checking and test it with the database, but I feel there could be a better solution since my database might end up being really specific with weird songs and artists, making spell checking useless.

I could also try something like what has been explained on this post, using Soundex for c#.

Using a regular spell checker won't work because I won't be using words but names and 'titles'.

So my question is: is there a relatively simple way of doing this, and if so, what is it?

Any kind of help would be appreciated.


What you want is a similarity factor. Essentially, you want to compare your inputs ("Micheal Jackson", for example) to your expected values ("Michael Jackson"); if you score a very high similarity value to one of your expected values, you can ask the user.

One way of doing this is to hash the expected values into a fully packed hashtable. If you get your hashing algorithm right (and yes, this is the tricky bit), each input will hash to the closest expected value; once you've found the closest expected value, you can run a similarity evaluation against the input and that expected value; if you're above a certain threshold, ask the user.

I run a website that allows users to write blog-post, I would really like to summarize the written content and use it to fill the <meta name="description".../>-tag for example.

What methods can I employ to automatically summarize/describe the contents of user generated content?
Are there any (preferably free) methods out there that have solved this problem?

(I've seen other websites just copy the first 100 or so words but this strikes me as a sub-optimal solution.)

At least 3 types of n-grams can be considered for representing text documents:

  • byte-level n-grams
  • character-level n-grams
  • word-level n-grams

It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs".

Are there other criteria to consider for choosing the "right" representation?

I would outright discard byte-level n-grams for text-related tasks, because bytes are not a meaningful representation of anything.

Of the 2 remaining levels, the character-level n-grams will need much less storage space and will , subsequently, hold much less information. They are usually utilized in such tasks as language identification, writer identification (i.e. fingerprinting), anomaly detection.

As for word-level n-grams, they may serve the same purposes, and much more, but they need much more storage. For instance, you'll need up to several gigabytes to represent in memory a useful subset of English word 3-grams (for general-purpose tasks). Yet, if you have a limited set of texts you need to work with, word-level n-grams may not require so much storage.

As for the issue of errors, a sufficiently large word n-grams corpus will also include and represent them. Besides, there are various smoothing methods to deal with sparsity.

There other issue with n-grams is that they will almost never be able to capture the whole needed context, so will only approximate it.

You can read more about n-grams in the classic Foundations of Statistical Natural Language Processing.

I am interested in doing a project on document classification and have been looking for books that could be useful for the theoretical parts in text mining related to this or examples of articles describing the process of going from training data with documents classified (with subcategories) to a system which predicts the class of a document. There seem to be some (rather expensive!) titles available but these are conference proceedings with articles on smaller very specific topics. Can someone suggest books from the data mining literature that provides a good theoretical basis for a project on text mining, specifically document classification or articles with an overview of this process ?

Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze have a free information retrieval book. Try chapter 13 - Text classification & Naive Bayes.

See also the companion site for Manning and Schütze's nlp book, specifically links for the text categorization chapter.

Fabrizio Sebastiani wrote a useful tutorial about text categorization(PDF) and review paper of machine learning for text categorization (PDF).

Natural language processing (NLP) is a subfield of artificial intelligence that involves transforming or extracting useful information from natural language data. Methods include machine-learning and rule-based approaches. It is often regarded as the engineering arm of Computational Linguistics.

NLP tasks

Beginner books on Natural Language Processing

Popular software packages