Click to See Complete Forum and Search --> : How to do ssociated tag search as in delicious.com


parkurm
February 16th, 2010, 12:23 AM
In delicious.com, users are able to bookmark website links with various meaningful tags, such as for Sorting Algorithm Animations, one can tag it as "algorithm", "sort", and later (maybe after months) can find this previously tagged bookmark by searching either "sort" or "algorithm". It greatly improves our way of keeping and organizing knowledge.

Also in delicious.com, when you search for a tag, say "algorithm", it is able to show all the associated tags to "algorithm". For example, the user has several bookmarks, and they are tagged as following:

Link1: algorithm sort
Link2: algorithm heap stack
Link3: algorithm heap fantastic
Link4: database architecture

Then, the associated tags to "algorithm" will be "sort", "heap“, ”stack", "fantastic" (all tags that showed up together with "algorithm" in bookmarks).

Further, we can also find associated tags to multiple tags. For example, when I search "algorithm" and "heap", the associated tags to these words will be "stack" and "fantastic".

So, associated tags are tags that showed up together in some bookmarks with the given tags.

The problem is how to implement this associated tag searching.

The naive way will be first searching all the links that have the given tags (when I search for "algorithm" and "heap", it will return Link2 and Link3), then by looking through all the returned links, I get the tags that are associated to "algorithm" and "heap".

This naive way surely works, but it's very time inefficient.

Also, we cannot use indexing of tags. Say there're 3 links, each with 2 tags:

L1: t1, t2
L2: t1, t3
L3: t2, t3

Then, if I search for associated tags to both "t1, t2", I'll get result as "t3", since t3 is indexed by both t1 in L2, and indexed by t2 in L3. But in fact there's no link that contains all t1,t2 and t3, so according to definition t3 should not appear in the result of associated tags.

D_Drmmr
February 16th, 2010, 11:13 AM
The problem is how to implement this associated tag searching.

That depends on the underlying data structure and the programming language.

The naive way will be first searching all the links that have the given tags (when I search for "algorithm" and "heap", it will return Link2 and Link3), then by looking through all the returned links, I get the tags that are associated to "algorithm" and "heap".

This naive way surely works, but it's very time inefficient.

That's at worst linear in terms of the the number of links. If that is "very inefficient" please indicate what kind of asymptotic performance you are looking for.
I believe there is better performance to be gained in the lookup of the tags. If you create a hash table of each tag, referring to each link that has the tag, you'll have constant lookup of each tag on average. Then to find the common subset of all tags you are looking for you can use an algorithm with logarithmic time in terms of the maximum number of links associated with one tag.

Also, we cannot use indexing of tags. Say there're 3 links, each with 2 tags:

L1: t1, t2
L2: t1, t3
L3: t2, t3

Then, if I search for associated tags to both "t1, t2", I'll get result as "t3", since t3 is indexed by both t1 in L2, and indexed by t2 in L3. But in fact there's no link that contains all t1,t2 and t3, so according to definition t3 should not appear in the result of associated tags.
I don't see what your example has to do with indexing. A flawed algorithm won't work regardless of the look-up mechanism you use.