Hi,

I am developing concurrent algorithm for block-sorting, ie something like algorithm used in bzip2. For block-sorting we can induce sorted order of some substrings from sorted order of another substrings, exploiting the fact that we are sorting substring form the same base substring.

In my program I am using Copy Transform, invented by Julian Seward, bzip2 author. I'll describe how this algorithm works in case of suffices. Some terminology first:

B_a is big bucket, holding all suffices starting from letter a,

b_ab is small bucket, holding all suffices starting from letters "ab"

Having sorted B_d bucket, I can induce sorted order of buckets {b_ad, b_bd, b_cd, b_ed, ..., b_zd} in a single pass over that big bucket. That's how it's done in original Copy Transform. Later, Seward notices that buckets of type b_dd can be induces from buckets {b_da, b_db, b_dc, d_de, ..., d_dz} (what greatly speed up his bzip2 on some input data).

My problem is to compute optimal sequence of inducing buckets, ie buckets are of different sizes, and I want to minimize the total size of buckets, which I have to sort using comparison sort, to induce the rest.

Actually, the efficiency look as this:

Code:Copy ascending Copy simple-ratio Improved two-stage chr22.dna 35.56 31.67 26.68 etext99 49.75 38.64 31.10 gcc-3.0.tar 42.63 33.46 26.77 howto 45.51 36.13 28.55 jdk13c 47.63 35.05 30.06 linux-2.4.5.tar 44.12 35.20 26.80 rctail96 49.00 36.57 30.28 rfc 40.17 32.28 25.83 sprot34.dat 43.14 38.07 29.26 w3c2 47.46 36.22 29.90 mean 44.50 35.33 28.52

Where:

- Copy ascending is a normal version of Copy Transform, where big buckets are sorted from smallest to biggest,

- Copy simple-ratio, is a version of Copy Transform which uses my heuristics to compute sorting order,

- Improved two-stage is a completely different algorithm for inducing sorted order, but I have no reasonable idea how to parallelize it,

Numbers means the perces of suffices, which must be sorted to induce the rest. As you can see, my heuristics improves greatly the efficiency of Copy Transform, but it's still far from ITS. (Nota bene: ITS is described in document: http://homepage3.nifty.com/wpage/software/itssort.txt ).

Anyone have an idea what to use, to obtain the optimal inducing sequence?

PS:

I have posted that problem on another forum where someone told me that he doesn't understand which substrings I am sorting.

Actually, my program is sorting cyclic shifts (rotations), but it can be made to sort suffices by adding a sentinel to the end of base string.

I'll present an example of sorting and inducing.

We have following string:

We are generating all of its cyclic shifts, we will be sorting them:Code:abbababababbbab

Let's start from sorting bucket b_ba. Here is the bucket before sorting:Code:abbababababbbab babbababababbba ababbababababbb bababbababababb bbababbabababab bbbababbabababa abbbababbababab babbbababbababa ababbbababbabab bababbbababbaba abababbbababbab babababbbababba ababababbbababb bababababbbabab bbababababbbaba

After sorting it becomes that:Code:babbababababbba bababbababababb babbbababbababa bababbbababbaba babababbbababba bababababbbabab

Now we are inducing bucket b_bb from the rest of small buckets of big bucket B_b. In our case there is only one another small bucket. To induce bucket b_bb, we are selecting only that substrings, which starts and ends on letter b:Code:bababababbbabab babababbbababba bababbababababb bababbbababbaba babbababababbba babbbababbababa

We can now induce from the the beginning of bucket b_bb:Code:bababababbbabab bababbababababb

We now see that we got a new substring, which ends on letter b (and also starts on letter b). We then induce the next position (like earlier, we are taking its cyclic shitf one place to left) - new element is added at the ending of bucket:Code:bbababababbbaba bbababbabababab

Now we have bucket B_b ready:Code:bbababababbbaba bbababbabababab bbbababbabababa

We can now then induce bucket b_ab. We need for it substring which are starting on letter b and ending on letter a:Code:bababababbbabab babababbbababba bababbababababb bababbbababbaba babbababababbba babbbababbababa bbababababbbaba bbababbabababab bbbababbabababa

We're inducing the contents of bucket b_ab in "traditional" way:Code:babababbbababba bababbbababbaba babbababababbba babbbababbababa bbababababbbaba bbbababbabababa

Now we have sorted everything, as bucket b_aa was empty.Code:ababababbbababb abababbbababbab ababbababababbb ababbbababbabab abbababababbbab abbbababbababab

A note on inducing buckets of type b_cc:

As I explained before, buckets of type b_cc can be induces from buckets {b_ca, b_cb, b_cd, ..., b_cz}. We must however rememeber about the appropriate order: buckets {b_ca, b_cb} (ie those that have second letter lexicographically smaller than c) are scanned forward and induced substrings are placed on first free positions of bucket b_cc. On ther other hand, buckets {b_cd, ..., b_cz} are scanned in reverse order and induced substrings are places on last free position of bucket b_cc.

In general my problem is not how to induce buckets (because I know how to do it), but how to compute optimal sequence of inducing, so that I would minimize the number of directly sorted substrings.