I am developing concurrent algorithm for block-sorting, ie something like algorithm used in bzip2. For block-sorting we can induce sorted order of some substrings from sorted order of another substrings, exploiting the fact that we are sorting substring form the same base substring.
In my program I am using Copy Transform, invented by Julian Seward, bzip2 author. I'll describe how this algorithm works in case of suffices. Some terminology first:
B_a is big bucket, holding all suffices starting from letter a,
b_ab is small bucket, holding all suffices starting from letters "ab"
Having sorted B_d bucket, I can induce sorted order of buckets {b_ad, b_bd, b_cd, b_ed, ..., b_zd} in a single pass over that big bucket. That's how it's done in original Copy Transform. Later, Seward notices that buckets of type b_dd can be induces from buckets {b_da, b_db, b_dc, d_de, ..., d_dz} (what greatly speed up his bzip2 on some input data).
My problem is to compute optimal sequence of inducing buckets, ie buckets are of different sizes, and I want to minimize the total size of buckets, which I have to sort using comparison sort, to induce the rest.
Where:
- Copy ascending is a normal version of Copy Transform, where big buckets are sorted from smallest to biggest,
- Copy simple-ratio, is a version of Copy Transform which uses my heuristics to compute sorting order,
- Improved two-stage is a completely different algorithm for inducing sorted order, but I have no reasonable idea how to parallelize it,
Numbers means the perces of suffices, which must be sorted to induce the rest. As you can see, my heuristics improves greatly the efficiency of Copy Transform, but it's still far from ITS. (Nota bene: ITS is described in document: http://homepage3.nifty.com/wpage/software/itssort.txt ).
Anyone have an idea what to use, to obtain the optimal inducing sequence?
PS:
I have posted that problem on another forum where someone told me that he doesn't understand which substrings I am sorting.
Actually, my program is sorting cyclic shifts (rotations), but it can be made to sort suffices by adding a sentinel to the end of base string.
I'll present an example of sorting and inducing.
We have following string:
Code:
abbababababbbab
We are generating all of its cyclic shifts, we will be sorting them:
Now we are inducing bucket b_bb from the rest of small buckets of big bucket B_b. In our case there is only one another small bucket. To induce bucket b_bb, we are selecting only that substrings, which starts and ends on letter b:
Code:
bababababbbabab
bababbababababb
We can now induce from the the beginning of bucket b_bb:
Code:
bbababababbbaba
bbababbabababab
We now see that we got a new substring, which ends on letter b (and also starts on letter b). We then induce the next position (like earlier, we are taking its cyclic shitf one place to left) - new element is added at the ending of bucket:
Now we have sorted everything, as bucket b_aa was empty.
A note on inducing buckets of type b_cc:
As I explained before, buckets of type b_cc can be induces from buckets {b_ca, b_cb, b_cd, ..., b_cz}. We must however rememeber about the appropriate order: buckets {b_ca, b_cb} (ie those that have second letter lexicographically smaller than c) are scanned forward and induced substrings are placed on first free positions of bucket b_cc. On ther other hand, buckets {b_cd, ..., b_cz} are scanned in reverse order and induced substrings are places on last free position of bucket b_cc.
In general my problem is not how to induce buckets (because I know how to do it), but how to compute optimal sequence of inducing, so that I would minimize the number of directly sorted substrings.
Bookmarks