I realize this is a broad topic, but I'm looking for a good primer on parsing meaning from text, ideally in Python. As an example of what I'm looking to do, if a user makes a blog post like:
"Manny Ramirez makes his return for the Dodgers today against the Houston Astros",
what's a light-weight/ easy way of getting the nouns out of a sentence? To start, I think I'd limit it to proper nouns, but I wouldn't want to be limited to just that (and I don't want to rely on a simple regex that assumes anything Title Capped is a proper noun).
To make this question even worse, what are the things I'm not asking that I should be? Do I need a corpus of existing words to get started? What lexical analysis stuff do I need to know to make this work? I did come across one other question on the topic and I'm digging through those resources now.
Here is the book I stumbled upon recently: Natural Language Processing with Python
I run a website that allows users to write blog-post, I would really like to summarize the written content and use it to fill the
<meta name="description".../>-tag for example.
What methods can I employ to automatically summarize/describe the contents of user generated content?
Are there any (preferably free) methods out there that have solved this problem?
(I've seen other websites just copy the first 100 or so words but this strikes me as a sub-optimal solution.)
Not a trivial task... You should look for articles or books on "extractive summarization"
A few starters could be:
I want to detect words in text, i.e. I need to know which characters in a given text are letters, that is they can be part of a (spoken) word and which are, on the other hand, punctuation and such.
For example, in the above sentence, "I", "want" and "i" and "e" are words in this regard, while spaces, "." and comma are not.
The difficulty in this is that I want to be able to read any kind of script that's based on Unicode. E.g., the german word "schön" is one word. But what about greek, arabic or japanese?
So, what I need is a table or list specifying all ranges of characters that can form words. Optionally, I also like to know which chars are digits that can form numbers (assuming other scripts have similar numbering schemes as the arabic numbers do).
I need this for Mac OS X, Windows and Linux. I'll write a C app, so it needs to be either a OS library or a complete code/data solution that I could translate into C.
I know that Mac OS (Cocoa) offers functions for this purpose, but I am not sure if there are similar solutions for Win and Linux (gtk based, probably?).
Alternatively, I could write my own code if I had the complete tables.
I have found the unicode charts (http://unicode.org/charts/index.html#scripts) but that's not coming in one convenient form I could use in programming.
So, can someone tell me if there are functions for Windows and Linux for this purpose, or where I can find a complete table/list of word characters in unicode?
If you are familiar with Python at all, the Natural Language Toolkit provides chunkers/ lexical tools that will do this across languages. I'd pretend to be smart here and tell you more, but everything I know is out of this book, which I highly recommend. I realize you could code up a technical solution with a regex that would get you 80% of the way to where you want to be, but why reinvent the wheel?
For use to analyze documents on the Internet!
Could you please provide more information why NLTK is insufficient or what features you need to consider some framework the "best"?
Nevertheless, there is the builtin shlex lexical parsing library.
There is also a recent book on the subject, Natural Language Processing with Python. It looks like at least part of it covers NLTK.