The thing is, we know that to merge two sorted lists we need to take two words from each list and compare them character by character. once a mismatch has been encountered then according to the lexicographic order of the mismatching characters we can decide which one goes first and which goes second.
b != a therefore aaaaax < aabax since a is lexicographically smaller then b
Now if the two sets have equal number of characters (n) that means we would need to do O(2n) number of comparisons to merge two sets. Let us assume that we know the longest common prefix of each word(string) in each set. That is, let us assume that we posses knowledge(a priori) that a first word in set A has the longest common prefix = 3, second =3, third =2 ... where longest common prefix is determined only with respect to words in set A and not set B. What would be the cleverest way to merge thees two sets ??
Now if the two sets have equal number of characters (n) that means we would need to do O(2n) number of comparisons to merge two sets.
I don't quite agree. I think the complexity is determined by the total number of words, not the number of characters. The complexity will be O(N+M) where N and M are the number of words in A and B respectively.
The number of characters enters the picture only indirectly. The actual runtime of the algorithm will depend on how many characters must be compared on average to establish the order between pairs of words. It will be a data dependent constant which varies with what the words actually look like.
Spontaneously I don't think the longest common prefix alone will be enought to lower the complexity. I think in addition you need to treat consecutive words with the same longest common prefix as groups. Then if an A prefix is enougth to establish the order in relation to a B word the same order holds for all words in that prefix group. It means that sometimes a whole A group can be ordered with just one comparision and that will lower N (and if B words are grouped too that will lower M). But the efficiency of such grouping will depend on the actual words and thus be data dependent so the complexity still is O(N+M).
Finally note that instead of grouping according to the longest common prefix a fullblown prefix tree can be used. The A list (and optionally the B list too) would be turned into a prefix tree where each node holds the common prefix of all its subnodes. When the order of an A tree node in relation to a B word is established the order of all subnodes (and the words they represent) is known to be the same. A prefix tree represents a more sophisticated grouping if you will (but not necessarily more efficient).
Last edited by nuzzle; November 9th, 2012 at 04:15 AM.
I took some time to think about what you have written. And now after thinking about the bigger problem that my question is a part of, i have concluded that i am stuck and i have no real solution. So, once more i ask the question and explain the concrete problem. First of all you were right in all your claims above. I was a bit careless when writing it down. Sorry.
So What I am trying to do is the following. I am a poor software developer that is working for a poor company that cannot buy new equipment. But we need to keep the company running and enable our services to be available for customers. we have several computing machines each 32GB of RAM. One of our applications requires to compute a suffix array (SA) for a given input. if the input is small (as the initial design of the software predicted) we have no problem but if the input becomes larger then 32GB of memory is simply not enough. So i was thinking about applying the divide-et-impera (or some might translate it as divide and concer) strategy. fastest suffix array construction algorithms today run in -> n log(n) time (average, or even n in average, there some tricks for reducing the complexity for certain small alphabets but i cannot profit from that). when input is large, that requires a substantial amount og memory rising up to several hundreds GBs. So i thought to split my input into peaces and then create SA for every chunk. Let say my input is of size N. if i split it into 2 then sorting each should take: N/2 log(N/2) time so the total time would be : 2(N/2(log(N)-log(2))) = 2(N/2(log(N)-1)) = (N log(N) ) - N. (please correct me if i am wrong) This will produce two SAs that i have to merge. From the previous post i said that my naive algorithm would take 2n = N steps to merge my two sets (I didn't mention that these are suffix array sets so M=N and N+N =2N). Therefore (N log(N)) - N + N = N log(N). which got me nowhere whit respect to the speed (asymptotically speaking) but decreased the memory requirements and enabled a simple parallelization (however with high constants associated to the process that in fact decreased the overall speed). What i am looking at right now is one more peace of information. in many cases users are artificially creasing the size of the input by concatenating the string to its revers and running it through the program. Does anyone know if there is any way to know the SA of the strings revers given the SA of the string so i don't need to do the sorting on the revers sequence?
Thank you for reading the post and for any (and i mean ANY) input.
Why not just write a small disk-backed array class/object? Page the regions of the array you are currently working with into some appropriately sized buffers. Then you can just do a regular merge-sort, abstracting away the memory requirements.
All advice is offered in good faith only. You are ultimately responsible for effects of your programs and the integrity of the machines they run on.