Hi,

I am developing concurrent algorithm for block-sorting, ie something like algorithm used in bzip2. For block-sorting we can induce sorted order of some substrings from sorted order of another substrings, exploiting the fact that we are sorting substring form the same base substring.

In my program I am using Copy Transform, invented by Julian Seward, bzip2 author. I'll describe how this algorithm works in case of suffices. Some terminology first:

B_a is big bucket, holding all suffices starting from letter a,
b_ab is small bucket, holding all suffices starting from letters "ab"

Having sorted B_d bucket, I can induce sorted order of buckets {b_ad, b_bd, b_cd, b_ed, ..., b_zd} in a single pass over that big bucket. That's how it's done in original Copy Transform. Later, Seward notices that buckets of type b_dd can be induces from buckets {b_da, b_db, b_dc, d_de, ..., d_dz} (what greatly speed up his bzip2 on some input data).

My problem is to compute optimal sequence of inducing buckets, ie buckets are of different sizes, and I want to minimize the total size of buckets, which I have to sort using comparison sort, to induce the rest.

Actually, the efficiency look as this:
Code:
                   Copy ascending      Copy simple-ratio   Improved two-stage
chr22.dna          35.56               31.67               26.68
etext99            49.75               38.64               31.10
gcc-3.0.tar        42.63               33.46               26.77
howto              45.51               36.13               28.55
jdk13c             47.63               35.05               30.06
linux-2.4.5.tar    44.12               35.20               26.80
rctail96           49.00               36.57               30.28
rfc                40.17               32.28               25.83
sprot34.dat        43.14               38.07               29.26
w3c2               47.46               36.22               29.90
 
mean               44.50               35.33               28.52

Where:
- Copy ascending is a normal version of Copy Transform, where big buckets are sorted from smallest to biggest,
- Copy simple-ratio, is a version of Copy Transform which uses my heuristics to compute sorting order,
- Improved two-stage is a completely different algorithm for inducing sorted order, but I have no reasonable idea how to parallelize it,

Numbers means the perces of suffices, which must be sorted to induce the rest. As you can see, my heuristics improves greatly the efficiency of Copy Transform, but it's still far from ITS. (Nota bene: ITS is described in document: http://homepage3.nifty.com/wpage/software/itssort.txt ).

Anyone have an idea what to use, to obtain the optimal inducing sequence?

PS:
I have posted that problem on another forum where someone told me that he doesn't understand which substrings I am sorting.

Actually, my program is sorting cyclic shifts (rotations), but it can be made to sort suffices by adding a sentinel to the end of base string.

I'll present an example of sorting and inducing.

We have following string:
Code:
abbababababbbab
We are generating all of its cyclic shifts, we will be sorting them:
Code:
abbababababbbab
babbababababbba
ababbababababbb
bababbababababb
bbababbabababab
bbbababbabababa
abbbababbababab
babbbababbababa
ababbbababbabab
bababbbababbaba
abababbbababbab
babababbbababba
ababababbbababb
bababababbbabab
bbababababbbaba
Let's start from sorting bucket b_ba. Here is the bucket before sorting:
Code:
babbababababbba
bababbababababb
babbbababbababa
bababbbababbaba
babababbbababba
bababababbbabab
After sorting it becomes that:
Code:
bababababbbabab
babababbbababba
bababbababababb
bababbbababbaba
babbababababbba
babbbababbababa
Now we are inducing bucket b_bb from the rest of small buckets of big bucket B_b. In our case there is only one another small bucket. To induce bucket b_bb, we are selecting only that substrings, which starts and ends on letter b:
Code:
bababababbbabab
bababbababababb
We can now induce from the the beginning of bucket b_bb:
Code:
bbababababbbaba
bbababbabababab
We now see that we got a new substring, which ends on letter b (and also starts on letter b). We then induce the next position (like earlier, we are taking its cyclic shitf one place to left) - new element is added at the ending of bucket:
Code:
bbababababbbaba
bbababbabababab
bbbababbabababa
Now we have bucket B_b ready:
Code:
bababababbbabab
babababbbababba
bababbababababb
bababbbababbaba
babbababababbba
babbbababbababa
bbababababbbaba
bbababbabababab
bbbababbabababa
We can now then induce bucket b_ab. We need for it substring which are starting on letter b and ending on letter a:
Code:
babababbbababba
bababbbababbaba
babbababababbba
babbbababbababa
bbababababbbaba
bbbababbabababa
We're inducing the contents of bucket b_ab in "traditional" way:
Code:
ababababbbababb
abababbbababbab
ababbababababbb
ababbbababbabab
abbababababbbab
abbbababbababab
Now we have sorted everything, as bucket b_aa was empty.

A note on inducing buckets of type b_cc:
As I explained before, buckets of type b_cc can be induces from buckets {b_ca, b_cb, b_cd, ..., b_cz}. We must however rememeber about the appropriate order: buckets {b_ca, b_cb} (ie those that have second letter lexicographically smaller than c) are scanned forward and induced substrings are placed on first free positions of bucket b_cc. On ther other hand, buckets {b_cd, ..., b_cz} are scanned in reverse order and induced substrings are places on last free position of bucket b_cc.

In general my problem is not how to induce buckets (because I know how to do it), but how to compute optimal sequence of inducing, so that I would minimize the number of directly sorted substrings.