Click to See Complete Forum and Search --> : find the occurrence of each small string in the larger one


wael.salman
June 17th, 2009, 09:20 AM
Given that you have one string of length N and M small strings of length L . How do you efficiently find the occurrence of each small string in the larger one?

Solution suggested:

You can just sort the M strings in O(l*Mlog(M)).An additional l figures because comparison function of strings of length l is of complexity O(l).
Once these M strings are sorted, we can simply do a binary search on them for each of the N-l+1 continuous substrings of big string. The complexity of this search for each such substring is O(l*logM).
So the complexity of this procedure is O(l*MlogM)+O((N-l+1)*(l*logM)).
For N>>l this reduces to O(l*MlogM)+O(N*l*log(M).
This can be reduced to O((M+N)*l*log(M)).


I believe there is always better solutions... if you have one - just drop and explain it please.


Thank you

binyo66
June 29th, 2009, 04:03 PM
Check out Knuth-morris-Pratt scan algorithm. I think it does O(n+L) in worst case scenario.

wael.salman
June 29th, 2009, 04:16 PM
I will check that.

Thank you for the hint :-)

prime1999
June 30th, 2009, 12:57 AM
KMP by itself would not work on this... would be O((N+L)M)...
A trie (prefix tree) would do for this (O(NL)).

binyo66
June 30th, 2009, 03:47 AM
Why KMP wont work?
BTW, Considering the best case in KNP and worst case in trie. Assume M=ABCDEFGH, and we want to search ABCDEFGH, KNP is faster the trie?

wael.salman
June 30th, 2009, 04:17 AM
How will you use Trie here??

binyo66
June 30th, 2009, 05:06 AM
It is based on binary tree as well, u can find the information and algo in wikipidia (seach prefix tree).

prime1999
June 30th, 2009, 05:14 AM
The point is... there are M given short strings!!

binyo66
June 30th, 2009, 08:07 AM
R u saying m cant be 1 ? How do you calculate O(n+L)m ? I think it depends on how far the distance between the 1st and 2nd occurance in KNP. We dont search from beginning again after find the 1st occurance, but we continue till the end. If the 2nd is next to 1st it still beat trie. If however the next occurance is far, trie might beat KNP.

prime1999
July 1st, 2009, 06:13 AM
There are M small strings. KMP is to find ONE string within another! "M small strings" and NOT "M, a small string".