Click to See Complete Forum and Search --> : Finding most repeated sequence


vpascual
November 29th, 2008, 10:04 AM
Dear all,

I have computed users sessions of a website. This is, I can represent each user behaviour as a sequence of requested pages.
What I would like now is to be able to recognise the most repeated sequences of pages. Can you give me hints on the best way to implement this? Is there any open source library for helping me in coding this problem? (I'm programming ing Java).

I guess using a tree structure will help. However, if I end up having a weighted Tree where each edge has a frequency regarding the number of user' sequences that have passed through it, how may I discover the most repeated path?

Thank you in advance!

pm_kirkham
November 30th, 2008, 05:50 AM
Why do you want this information?

It won't be a tree, simply because people don't always browse linearly - it is, after all, a web of hypertext not a linear document. So expect branches and cycles (go to a forum index, open some posts to read in tabs, read the posts, reply to some, go back to forum index, repeat).

So it could be quite a complex path, more than can be represented by just incrementing the weight on each link/edge between pages/nodes.

On the other hand, a simple Markov model might be all you need - what is the most likely transition from a given page. Which might tell you whether users can find links between your pages, rather than relying on google, but isn't quite the same as detailed record of how a user browses the site.

Zachm
November 30th, 2008, 08:18 AM
If all you are interested in is the series of links a user clicked while browsing the site, not the actual activity the user has performed, you can code each page (or link if that's what your'e interested in) with a distinct ID, like
1,2,3, ... and so on.
Now, you can code each users path (even a circular path) in the website by a string that may look like:
"1,3,25,4,4,5,1,3,22,25"

Each path is guaranteed to be unique, and you can insert each user's path into a hash, where the keys are the paths and the values are the number of times the specific path was used. If you are interested in all of the sub paths as well, you can insert all of them into the hash as well. The frequency of a single path is simply the number of times a path was walked in - it's value in the hash.

Regards,
Zachm