Algorithms of the Intelligent Web

Haralambos Marmanis, Dmitry Babenko

Mentioned 13

Provides information on creating applications that collect, analyze, and act on the data that is left by users on the Web.

More on

Mentioned in questions and answers.

I've asked a simlar question on Meta Stack Overflow, but that deals specifically with whether or not Lucene.NET is used on Stack Overflow.

The purpose of the question here is more of a hypotetical, as to what approaches one would make if they were to use Lucene.NET as a basis for in-site search and other factors in a site like Stack Overflow [SO].

As per the entry on the Stack Overflow blog titled "SQL 2008 Full-Text Search Problems" there was a strong indication that Lucene.NET was being considered at some point, but it appears that is definitely not the case, as per the comment by Geoff Dalgas on February 19th 2010:

Lucene.NET is not being used for Stack Overflow - we are using SQL Server Full Text indexing. Search is an area where we continue to make minor tweaks.

So my question is, how would one utilize Lucene.NET into a site which has the same semantics of Stack Overflow?

Here is some background and what I've done/thought about so far (yes, I've been implementing most of this and search is the last aspect I have to complete):


And of course, the star of the show, Lucene.NET.

The intention is also to move to .NET/C# 4.0 ASAP. While I don't think it's a game-changer, it should be noted.

Before getting into aspects of Lucene.NET, it's important to point out the SQL Server 2008 aspects of it, as well as the models involved.


This system has more than one primary model type in comparison to Stack Overflow. Some examples of these models are:

  • Questions: These are questions that people can ask. People can reply to questions, just like on Stack Overflow.
  • Notes: These are one-way projections, so as opposed to a question, you are making a statement about content. People can't post replies to this.
  • Events: This is data about a real-time event. It has location information, date/time information.

The important thing to note about these models:

  • They all have a Name/Title (text) property and a Body (HTML) property (the formats are irrelevant, as the content will be parsed appropriately for analysis).
  • Every instance of a model has a unique URL on the site

Then there are the things that Stack Overflow provides which IMO, are decorators to the models. These decorators can have different cardinalities, either being one-to-one or one-to-many:

  • Votes: Keyed on the user
  • Replies: Optional, as an example, see the Notes case above
  • Favorited: Is the model listed as a favorite of a user?
  • Comments: (optional)
  • Tag Associations: Tags are in a separate table, so as not to replicate the tag for each model. There is a link between the model and the tag associations table, and then from the tag associations table to the tags table.

And there are supporting tallies which in themselves are one-to-one decorators to the models that are keyed to them in the same way (usually by a model id type and the model id):

  • Vote tallies: Total postive, negative votes, Wilson Score interval (this is important, it's going to determine the confidence level based on votes for an entry, for the most part, assume the lower bound of the Wilson interval).

Replies (answers) are models that have most of the decorators that most models have, they just don't have a title or url, and whether or not a model has a reply is optional. If replies are allowed, it is of course a one-to-many relationship.

SQL Server 2008

The tables pretty much follow the layout of the models above, with separate tables for the decorators, as well as some supporting tables and views, stored procedures, etc.

It should be noted that the decision to not use full-text search is based primarily on the fact that it doesn't normalize scores like Lucene.NET. I'm open to suggestions on how to utilize text-based search, but I will have to perform searches across multiple model types, so keep in mind I'm going to need to normalize the score somehow.


This is where the big question mark is. Here are my thoughts so far on Stack Overflow functionality as well as how and what I've already done.



I believe each model should have an index of its own containing a unique id to quickly look it up based on a Term instance of that id (indexed, not analyzed).

In this area, I've considered having Lucene.NET analyze each question/model and each reply individually. So if there was one question and five answers, the question and each of the answers would be indexed as one unit separately.

The idea here is that the relevance score that Lucene.NET returns would be easier to compare between models that project in different ways (say, something without replies).

As an example, a question sets the subject, and then the answer elaborates on the subject.

For a note, which doesn't have replies, it handles the matter of presenting the subject and then elaborating on it.

I believe that this will help with making the relevance scores more relevant to each other.


Initially, I thought that these should be kept in a separate index with multiple fields which have the ids to the documents in the appropriate model index. Or, if that's too large, there is an index with just the tags and another index which maintains the relationship between the tags index and the questions they are applied to. This way, when you click on a tag (or use the URL structure), it's easy to see in a progressive manner that you only have to "buy into" if you succeed:

  • If the tag exists
  • Which questions the tags are associated with
  • The questions themselves

However, in practice, doing a query of all items based on tags (like clicking on a tag in Stack Overflow) is extremely easy with SQL Server 2008. Based on the model above, it simply requires a query such as:

     m.Name, m.Body
    Models as m
        left outer join TagAssociations as ta on
            ta.ModelTypeId = <fixed model type id> and
            ta.ModelId = m.Id
        left outer join Tags as t on t.Id = ta.TagId
    t.Name = <tag>

And since certain properties are shared across all models, it's easy enough to do a UNION between different model types/tables and produce a consistent set of results.

This would be analagous to a TermQuery in Lucene.NET (I'm referencing the Java documentation since it's comprehensive, and Lucene.NET is meant to be a line-by-line translation of Lucene, so all the documentation is the same).

The issue that comes up with using Lucene.NET here is that of sort order. The relevance score for a TermQuery when it comes to tags is irrelevant. It's either 1 or 0 (it either has it or it doesn't).

At this point, the confidence score (Wilson score interval) comes into play for ordering the results.

This score could be stored in Lucene.NET, but in order to sort the results on this field, it would rely on the values being stored in the field cache, which is something I really, really want to avoid. For a large number of documents, the field cache can grow very large (the Wilson score is a double, and you would need one double for every document, that can be one large array).

Given that I can change the SQL statement to order based on the Wilson score interval like this:

     m.Name, m.Body
    Models as m
        left outer join TagAssociations as ta on
            ta.ModelTypeId = <fixed model type id> and
            ta.ModelId = m.Id
        left outer join Tags as t on t.Id = ta.TagId
        left outer join VoteTallyStatistics as s on
            s.ModelTypeId = ta.ModelTypeId and
            s.ModelId = ta.ModelId
    t.Name = <tag>
order by
    --- Use Id to break ties.
    s.WilsonIntervalLowerBound desc, m.Id

It seems like an easy choice to use this to handle the piece of Stack Overflow functionality "get all items tagged with <tag>".


Originally, I thought this is in a separate index of its own, with a key back into the Questions index.

I think that there should be a combination of each model and each reply (if there is one) so that relevance scores across different models are more "equal" when compared to each other.

This would of course bloat the index. I'm somewhat comfortable with that right now.

Or, is there a way to store say, the models and replies as individual documents in Lucene.NET and then take both and be able to get the relevance score for a query treating both documents as one? If so, then this would be ideal.

There is of course the question of what fields would be stored, indexed, analyzed (all operations can be separate operations, or mix-and-matched)? Just how much would one index?

What about using special stemmers/porters for spelling mistakes (using Metaphone) as well as synonyms (there is terminology in the community I will service which has it's own slang/terminology for certain things which has multiple representations)?


This is related to indexing of course, but I think it merits it's own section.

Are you boosting fields and/or documents? If so, how do you boost them? Is the boost constant for certain fields? Or is it recalculated for fields where vote/view/favorite/external data is applicable.

For example, in the document, does the title get a boost over the body? If so, what boost factors do you think work well? What about tags?

The thinking here is the same as it is along the lines of Stack Overflow. Terms in the document have relevance, but if a document is tagged with the term, or it is in the title, then it should be boosted.

Shashikant Kore suggests a document structure like this:

  • Title
  • Question
  • Accepted Answer (Or highly voted answer if there is no accepted answer)
  • All answers combined

And then using boost but not based on the raw vote value. I believe I have that covered with the Wilson Score interval.

The question is, should the boost be applied to the entire document? I'm leaning towards no on this one, because it would mean I'd have to reindex the document each time a user voted on the model.

Search for Items Tagged

I originally thought that when querying for a tag (by specifically clicking on one or using the URL structure for looking up tagged content), that's a simple TermQuery against the tag index for the tag, then in the associations index (if necessary) then back to questions, Lucene.NET handles this really quickly.

However, given the notes above regarding how easy it is to do this in SQL Server, I've opted for that route when it comes to searching tagged items.

General Search

So now, the most outstanding question is when doing a general phrase or term search against content, what and how do you integrate other information (such as votes) in order to determine the results in the proper order? For example, when performing this search on ASP.NET MVC on Stack Overflow, these are the tallies for the top five results (when using the relevance tab):

    q votes answers accepted answer votes highlights mvc highlights
    ------- ------- --------------------- ------------------ --------------
         21      26                    51                  2              2
         58      23                    70                  2              5
         29      24                    40                  3              4
         37      15                    25                  1              2
         59      23                    47                  2              2

Note that the highlights are only in the title and abstract on the results page and are only minor indicators as to what the true term frequency is in the document, title, tag, reply (however they are applied, which is another good question).

How is all of this brought together?

At this point, I know that Lucene.NET will return a normalized relevance score, and the vote data will give me a Wilson score interval which I can use to determine the confidence score.

How should I look at combining tese two scores to indicate the sort order of the result set based on relevance and confidence?

It is obvious to me that there should be some relationship between the two, but what that relationship should be evades me at this point. I know I have to refine it as time goes on, but I'm really lost on this part.

My initial thoughts are if the relevance score is beween 0 and 1 and the confidence score is between 0 and 1, then I could do something like this:

1 / ((e ^ cs) * (e ^ rs))

This way, one gets a normalized value that approaches 0 the more relevant and confident the result is, and it can be sorted on that.

The main issue with that is that if boosting is performed on the tag and or title field, then the relevance score is outside the bounds of 0 to 1 (the upper end becomes unbounded then, and I don't know how to deal with that).

Also, I believe I will have to adjust the confidence score to account for vote tallies that are completely negative. Since vote tallies that are completely negative result in a Wilson score interval with a lower bound of 0, something with -500 votes has the same confidence score as something with -1 vote, or 0 votes.

Fortunately, the upper bound decreases from 1 to 0 as negative vote tallies go up. I could change the confidence score to be a range from -1 to 1, like so:

confidence score = votetally < 0 ? 
    -(1 - wilson score interval upper bound) :
    wilson score interval lower bound

The problem with this is that plugging in 0 into the equation will rank all of the items with zero votes below those with negative vote tallies.

To that end, I'm thinking if the confidence score is going to be used in a reciprocal equation like above (I'm concerned about overflow obviously), then it needs to be reworked to always be positive. One way of achieving this is:

confidence score = 0.5 + 
    (votetally < 0 ? 
        -(1 - wilson score interval upper bound) :
        wilson score interval lower bound) / 2

My other concerns are how to actually perform the calculation given Lucene.NET and SQL Server. I'm hesitant to put the confidence score in the Lucene index because it requires use of the field cache, which can have a huge impact on memory consumption (as mentioned before).

An idea I had was to get the relevance score from Lucene.NET and then using a table-valued parameter to stream the score to SQL Server (along with the ids of the items to select), at which point I'd perform the calculation with the confidence score and then return the data properly ordred.

As stated before, there are a lot of other questions I have about this, and the answers have started to frame things, and will continue to expand upon things as the question and answers evovled.

The answers you are looking for really can not be found using lucene alone. You need ranking and grouping algorithms to filter and understand the data and how it relates. Lucene can help you get normalized data, but you need the right algorithm after that.

I would recommend you check out one or all of the following books, they will help you with the math and get you pointed in the right direction:

Algorithms of the Intelligent Web

Collective Intelligence in Action

Programming Collective Intelligence

I am developing e-shop where I will sell food. I want to have a suggestion box where I would suggest what else my user could buy based on what he's already have in cart. If he has beer, I want him to suggest chips and other things by descending precentage of probability that he'll buy it too. But I want that my algorithm would learn to suggest groceries based on the all users' previous purchases. Where should I start? I have groceries table user_id, item_id, date and similar. How can I make a suggestion box without brute-forcing which is impossible.

Humm... you are looking for a product recommendation engine then... Well, they come, basically, in three flavours:

  • Collaborative filtering
  • Content-based filtering
  • Hybrid recommender systems

   The first one gathers and stores data on your users' activities, preferences, behavior, etc... This data is then sent into an engine that separates it into user channels. Each channel has certain characteristic likes and dislikes. So, when you have a new visitor he or she will be classified and be assiged an specific user profile. Then items will be displayed based on this profile's likes/dislikes.

   Now, content-based filtering uses a different approach - a less social one - by taking into account ONLY your user's previous browsing history, his preferences and activities. Essentially, this will create recommendations based on what this user has previously liked/purchased.

   But why choose just one of them, right? Hybrid recommender systems uses a bit of both to provide a personalized yet social recommendation. These are usually more accurate when it comes to providing recommendations.

   I think that the collaborative filtering is a great option when you have a big influx of users - it's kinda hard to build good channels with only 42 users/month accessing your website. The second option, based on content, is better for a small site with plenty of products - however, IMHO, the third one is the one for you - build something that will get users going from the start and gather all that data they generate to, in the future, be able to offer a amazon-like recommendation experience!

   Building one of these is no easy task as I'm sure you already know... but I strongly recommend this book (using a personal-history filtering!) which has really came through for me in the past:

Good luck and good learning!

I've looked at Algorithms of the Intelligent Web that describes (page 55) an interesting algorithm - called DocRank - for creating a PageRank like score for business documents (i.e. documents without links like PDF, MS Word documents, etc...). In short it analyzes term frequency intersection between each document in a collection.

Can anyone else identify interesting algorithms described elsewhere, or wants to share something novel here, to apply against these types of documents to improve search results?

Please forgo answers involving things like click tracking or other actions NOT about analyzing the actual documents.

First Technique: step-wise similarity

I can offer one example--that i've actually tested/validated against real data. If you were to gather a number of techniques and rank them along two axes--inherent complexity or ease of implementation AND performance (resolution, or predictive accuracy), this technique would be high on the first axis, and somewhere near the middle on the second; a simple and effective technique, but which might underperform against state-of-the-art techniques.

We found that the combination of low-frequency keyword intersection combined with similarity among readers/viewers of a document, is a fairly strong predictor of the document's content. Put another way: if two documents have the a similar set of very low-frequency terms (e.g., domain-specific terms, like 'decision manifold', etc.) and they have similar inbound traffic profiles, that combination is strongly probative of similarity of the documents.

The relevant details:

first filter: low-frequency terms. We parsed a large set of documents to get the term frequency for each. We used this word frequency spectrum as a 'fingerprint', which is common, but we applied an inverse weighting, so that the common terms ('a', 'of', 'the') counted very little in the similarity measure, whereas the rare terms counted a lot (this is quite common, as you probably know).

Trying to determine whether two documents were similar based on this along was problematic; for instance, two documents might share a list of rare terms relating to MMOs, and still the documents weren't similar because one is directed to playing MMOs and the other to designing them.

second filter: readers. Obviously we don't know who had read these documents, so we inferred readership from traffic sources. You can see how that helped in the example above. The inbound traffic for the MMO player site/document reflected the content, likewise for the document directed to MMO design.

Second Technique: kernel Principal Component Analysis (kPCA)

kPCA is unsupervised technique (the class labels are removed from the data before the data is passed in). At the heart of the technique is just an eigenvector-based decomposition of a matrix (in this case a covariance matrix). This technique handles non-linearity via the kernel trick, which just maps the data to a higher dimensional features space then performs the PCA in that space. In Python/NumPy/SciPy it is about 25 lines of code.

The data is collected from very simple text parsing of literary works--in particular, most of the published works of these four authors: Shakespeare, Jane Austen, Jack London, Milton. (I believe, though i'm not certain, that normal college students take course in which they are assigned to read novels by these authors.)

This dataset is widely used in ML and is available from a number of places on the Web.

So these works were divided into 872 pieces (corresponding roughly to chapters in novels); in other words, about 220 different substantial pieces of text for each of the four authors.

Next a word-frequency scan was performed on the combined corpus text, and the 70 most common words were selected for the study, the remainder of the results of the frequency scan were discarded.

Those 70 words were:

[ 'a', 'all', 'also', 'an', 'and', 'any', 'are', 'as', 'at', 'be', 'been',
  'but', 'by', 'can', 'do', 'down', 'even', 'every', 'for', 'from', 'had',
  'has', 'have', 'her', 'his', 'if', 'in', 'into', 'is', 'it', 'its', 'may',
  'more', 'must', 'my', 'no', 'not', 'now', 'of', 'on', 'one', 'only', 'or', 
  'our', 'should', 'so', 'some', 'such', 'than', 'that', 'the', 'their', 
  'then', 'there', 'things', 'this', 'to', 'up', 'upon', 'was', 'were', 'what',
  'when', 'which', 'who', 'will', 'with', 'would', 'your', 'BookID', 'Author' ]

These became the field (column) names. Finally, one data row corresponding to the 872 texts was prepared (from the truncated word frequency scan). Here's one of those data points:

[ 46, 12, 0, 3, 66, 9, 4, 16, 13, 13, 4, 8, 8, 1, 0, 1, 5, 0, 21, 12, 
  16, 3, 6, 62, 3, 3, 30, 3, 9, 14, 1, 2, 6, 5, 0, 10, 16, 2, 54, 7, 8,
  1, 7, 0, 4, 7, 1, 3, 3, 17, 67, 6, 2, 5, 1, 4, 47, 2, 3, 40, 11, 7, 5,
  6, 8, 4, 9, 1, 0, 1 ]

In sum, the data is comprised of 70 dimensions (each dimension is the frequency or total count of a particular word, in a given text of one of these four authors.

Again, although this data is primarily used for supervised classification (the class labels are there for a reason), the technique i used was unsupervised--put another way, i never showed the class labels to the algorithm. The kPCA algorithm has absolutely no idea what those four different clusters (shown in the plot below) correspond to nor how each differs from the other--the algorithm did not even know how many groups (classes) that data was comprised of. I just gave it the data, and it partitioned it very neatly into four distinct groups based on an inherent ordering.

The results:alt text

Again, the algorithm i used here is kPCA. Using Python, NumPy, and Matplotlib, the script that produced these results is about 80 lines of code--for the IO, data processing, applying kPCA, and plotting the result.

Not much, but too much for an SO post. In any event, anyone who wants this code can get it from my repo. At the same time, there is also a complete, well-documented kPCA algorithm coded in python + numpy in each of these python packages (all available from shogun ('A Large Scale Machine Learning Toolbox'), 'sdpy (a set of modules directed to computer vision and machine learning), and mlpy ('Machine Learning in PYthon').

I am planning to implement a basic recommendation system that uses Facebook Connect or similar social networking site API's to connect a user's profile, do an analysis based on tags and use the results to generate item recommendations on my e-commerce site (works similar to Amazon).

I do believe I need to divide parts into such:

  1. Fetching social networking data via API's.(Indeed user allows this)

  2. Analyze these data and generate tokes.

  3. By using information tokens, do item recommendations on my e-commerce site.

EG: I am a fan of "The Strokes" band on my Facebook account, system analyzes this and recommends me "The Strokes Live" CD.

For any part(fetching data, doing recommendation based on tags...), what algorithm and method would you recommend/ is used?

Good practical books on these kind of algorithms are:

I have 3 main questions about the algorithms in intelligent web (web 2.0)

Here the book I'm reading and I want to learn the algorithms in deeper

1. People You may follow (Twitter)

How can one determine the nearest result to my requests ? Data mining? which algorithms?

2. How you’re connected feature (Linkedin)

Simply algorithm works like that. It draws the path between two nodes let say between Me and the other person is C. Me -> A, B -> A connections -> C . It is not any brute force algorithms or any other like graph algorithms :)

3. Similar to you (Twitter, Facebook) This algorithms is similar to 1. Does it simply work the max(count) friend in common (facebook) or the max(count) follower in Twitter? or any other algorithms they implement? I think the second part is true because running the loop

 dict{count, person}
 for person in contacts:
 return dict(max)

is a silly act in every refreshing page.

4. Did you mean (Google) I know that they may implement it with phonetic algorithm simply soundex and here is the Google VP of Engineering and CIO Douglas Merrill speak

What about first 3 questions? Any ideas are welcome !


People you may follow

Could be one of many types of recommendation algorithms, maybe collaborative filtering?

How you are connected

This is just a shortest path algorithm on the social graph. Assuming there is no weight to the connections, it will simply use breadth-first.

Similar to you

Simply a re-arrangement of the data set using the same algorithm as People you may follow.

Check out the book Programming Collective Intelligence for a good introduction to the type of algorithms that are used for People you may follow and Similar to you, it has great python code available too.

People who you may follow

You can use the factors based calculations:

factorA = getFactorA(); // say double(0.3)
factorB = getFactorB(); // say double(0.6)
factorC = getFactorC(); // say double(0.8)

result = (factorA+factorB+factorC) / 3 // double(0.5666666666666667)
// if result is more than 0.5, you show this person

So say in the case of Twitter, "People who you may follow" can based on the following factors (User A is the user viewing this "People who you may follow" feature, there may be more or less factors):

  • Relativity between frequent keywords found in User A's and User B's tweets
  • Relativity between the profile description of both users
  • Relativity between the location of User A and B
  • Are people User A is following follows User B?

So where do they compare "People who you may follow" from? The list probably came from a combination of people with high amount of followers (they are probably celebrities, alpha geeks, famous products/services, etc.) and [people whom User A is following] is following.

Basically there's a certain level of data mining to be done here, reading the tweets and bios, calculations. This can be done on a daily or weekly cron job when the server load is least for the day (or maybe done 24/7 on a separate server).

How are you connected

This is probably a smart work here to make you feel that loads of brute force has been done to determine the path. However after some surface research, I find that this is simple:

Say you are User A; User B is your connection; and User C is a connection of User B.

In order for you to visit User C, you need to visit User B's profile first. By visiting User B's profile, the website already save the info indiciating that User A is at User B's profile. So when you visit User C from User B, the website immediately tells you that 'User A -> User B -> User C', ignoring all other possible paths.

This is the max level as at User C, User Acannot go on to look at his connections until User C is User A's connection.

Source: observing LinkedIN

Similar to you

It's the exact same thing as #1 (People you may follow), except that the algorithm reads in a different list of people. The list of people that the algorithm reads in is the people whom you follow.

Did you mean

Well you got it right there, except that Google probably used more than just soundex. There's language translation, word replacement, and many other algorithms used for the case of Google. I can't comment much on this because it will probably get very complex and I am not an expert to handle languages.

If we research a little more into Google's infrastructure, we can find that Google has servers dedicated to Spelling and Translation services. You can get more information on Google platform at


The key to highly intensified algorithms is caching. Once you cache the result, you don't have to load it every page. Google does it, Stack Overflow does it (on most of the pages with list of questions) and Twitter not surprisingly too!

Basically, algorithms are defined by developers. You may use others' algorithms, but ultimately, you can also create your own.


I have 5 items in a table [1], each item has 4 attributes (red, green, blue, yellow).
Each attribute can be given a score between 1 and 9 [2].

When performing a search on my website users can specify how relevant each attribute is to the search results by giving each attribute a score between 1 and 9.

What algorithm should I use to calculate and order the results based on the users preference?


[1] - CREATE TABLE items( id INT NOT NULL AUTO_INCREMENT , name VARCHAR(128) , red INT , green INT , blue INT , yellow INT , PRIMARY KEY (id) );

[2] - INSERT INTO items (NAME, red, green, blue, yellow) VALUES ('Random 1', 4, 1, 9, 4), ('Random 2', 1, 1, 2, 9), ('Random 3', 5, 7, 6, 3), ('Random 4', 2, 2, 8, 1);

Sorry but i've not a direct answer. This is a very interesting topic. You can use something related to the euclidean distance, or Pearson correlation. You can find more in books related to Collective Intelligence.

Of course it's more difficult to implement things like these, but your results we'll be much more accurate and precise. I recommend these books:

Algorithms of the Intelligent Web

Programming Collective Intelligence: Building Smart Web 2.0 Applications

We want to add a comments/reviews feature to our website's plugin gallery, so users can vote up/down a particular plugin and leave an optional short comment about what they liked or didn't like about it.

I'm looking for inspiration, ideally a good implementation elsewhere on the web which isn't annoying to end users, isn't impossibly complex to develop, and which enables users to see both good and bad comments side-by-side, like this:

Like: 57 votes                           Dislike: 8 votes
---------------------------------        --------------------------------
"great plugin, saved me hours..."        "hard to install"

"works well on MacOS and Ubuntu"         "Broken on Windows Vista with 
                                             UAC enabled"
"integrates well with version 3.2"      

Anyone know a site which does something like this particularly well? I don't need source code (since implementation will be simple), just looking for UI inspiration and best practices.

I'll accept the answer which refers me to the site which, in my biased opinion, is the best mix of usable (for end users reading comments) and addictive (for active users leaving comments).

Amazon does something similiar - when you list customer reviews, they list the most helpful positive and negative review at the top, before listing everything else. For example:

To do this, though, the community needs the ability to flag a review as helpful or unhelpful, which might be more than you're looking to implement. I like your idea, though - rather than sorting through a large list of reviews, I like seeing a quick snapshot of what people liked or didn't like about something, even if one of those two opinions is relatively rare.

You might want to check out these book for algorithms:

Are there standard rules engine/algorithms around AI that would predict the user taste on a particular kind of product like clothes. I know it's one thing all e-commerce website will kill for. But I am looking out for theoretical patterns defined out there which would help make that prediction in a better way, if not accurately.

Two books that cover recommender systems:

  • Programming Collective Intelligence: Python, does a good job explaining the algorithm, but doesn't provide enough help IMO in terms of understanding how to scale.
  • Algorithms of the Intelligent Web: Java, harder to follow, but also covers using persistence, in this case MySQL, to facilitate scaling and identifiers areas in example code that will not scale as-is.

Basically two ways of approaching the problem, user or item based. Netflix appears to use the former, while Amazon the latter. Typically user based requires more time and/or processing power to generate recommendations because you tend to have more users than items to consider.

I have an application where users can:

  1. Write reviews about products
  2. Add comments to products
  3. Up / Down vote reviews
  4. Up / Down vote comments

Every Up/Down vote is recorded in a db table.

What i want to do now is to create a ranking of the most active users in the last 4 weeks. Of course good reviews should be weighted more than good comments. But also e.g. 10 good comments should be weighted more than just one good review.


// reviews created in recent 4 weeks
//format: [ upVoteCount, downVoteCount ]
var reviews = [ [120,23], [32,12], [12,0], [23,45] ];

// comments created in recent 4 weeks
// format: [ upVoteCount, downVoteCount ]
var comments = [ [1,2], [322,1], [0,0], [0,45] ];

// create weight vector
// format: [ reviewWeight, commentsWeight ]
var weight = [0.60, 0.40];

// signature: activties..., activityWeight
var userActivityScore = score(reviews, comments, weight);
... update user table ...

List<Users> users = "from users u order by u.userActivityScore desc";

How would a fair scoring function look like? How could an implementation of the score() function look like? How to add a weight g to the function so that reviews are weighted heavier? How would such a function look like if, for example, votes for pictures would be added?

These kinds of algorithms can be quite complex. You might want to take a look into these books:

I'd like to build a recommendation system on my site so people can "like" certain things, and then display other things that other people liked as well. A "recommendation" system if you will.


  • Amazon - shows items other people liked as well.
  • Newegg - shows items other people bought together.
  • 9gag - shows pictures liked by other people.

I'm new to this sort of software design and literally have no idea on where to start. Would love some suggestions and reading material and even better a pattern that has already been cooked up by smarter people.

My items will be either "liked" or "neutral" (user didn't do anything). No fine tuned rating system for the item, I assume this would simplify the needed algorithm.

My platform is C# with ASP.Net MVC3 with MSSQL 2008 as the backend; if that is relevant to the discussion.

You may want to read a book which describes this problem space like Algorithms of the Intelligent Web or the older Programming Collective Intelligence.

Building recommendation based algorithms usually do not follow a particular set of design patterns...they vary based on your application's domain...

If you are not aware of the concepts/mechanics of these kind of algorithms, these algorithms are build based on the concepts like Collective Intelligence, Machine Learning, Crowd Sourcing etc...

Programming Collective Intelligence is by far the best reading you can get...this book should give you a very good perspective on how to build an algorithm for your requirement. But unfortunately, this book uses python...for code examples...but still the concepts are very helpful...

I'm trying to build an Ad system. Some time before i wanted to build a WebCrawler, and found tons of Papers and doc.

But for Ads i've not found anything useful yet. Everything is related to "CPC, CPM" and stuff like that (marketing related). Nothing "architectural" or "technical" that can help me to build it the real system, and write some code down.

Can you help me with some links or references?

Thanks a lot.

While it doesn't talk about advertising directly, the book Algorithms of the Intelligent Web discusses several topics that I would imagine would be very valuable in setting up an advertising system - especially concerning how/where to place ads and how to target viewers (provide ad content tailored to a viewer or to a specific webpage with a certain class of viewers in mind). These include:

  • Search algorithms
  • Suggestion algorithms
  • Recommendation algorithms
  • Clustering algorithms
  • Classification algorithms
  • Developing dynamic content based on users' input/behavior
  • etc.

You may want to also check out Collective Intelligence in Action. It discusses things like:

  • Architectures for intelligent/social web,
  • Extracting intelligence from tags and keywords
  • Content architectures
  • Detecting valuable phrases
  • Recommendation engines
  • etc.

I would like to ask something about the implementation of Intelligent Product Recommendation

As I need to build an online shopping system using JSP and Mysql, I want to implement a function,

which can automatically recommend some related products when the user checks for one product details.

However, I did not have any learning experience for Artificial Intelligence.

There are too many resources on the Internet but I don't know which are good for learning.

Therefore, would anyone suggest some useful websites for studying such type of programming skill(that is, AI).

Thanks very much!

This book will definitely help you: "Algorithms of the Intelligent Web"
It has some cool explanations of how it works and how you can implement that.

Can anybody explain me simple search engine?

What it should look like, what components it should have and how it's working?

There is a web crawler, there is indexing and querying is what I know. What part of it is the most difficult to do?

Where to use pagerank algorithm - in crawling? or in querying i.e. showing the results? What is indexing?

I read stuff but it's little bit complicated.

What I would like to do is to create simple java search engine. It doesn't matter what algorithm will be used, I have breadth-first so far, I think it is the simpliest algorithm. I have a simple web crawler, I enter seed url and limit of searched pages. Firstly crawler checks link, robots.txt and if it can it downloads first page, extract urls from page and adds them to list. When crawler finish extracting urls from first page, it takes first url in list and extract links and so on.

What about indexing?

I really dont understand this part. If I want full-page indexing how should I do that? Simply add full text of downloaded page to database?

Indexing is my most important part to do so please explain this part.

Thanx in advance!

The book Algorithms of the Intelligent Web has an excellent introduction to the PageRank algorithm and a nice walk through of implementing it yourself. I suggest you get a copy of this and work through Chapter 2 to get a good understanding of this space.