random partition in to k groups

**nuzzle** · September 1st, 2012, 11:36 AM

Originally Posted by dshawul

Well it seems the easiest approach of picking one at a time randomly works well.

I've run your example and I think I better understand the problem now. Using the specific numbers from the example you want each of the 10 rows to sum up to 4000 and each of the 3 columns to sum up to 10000, 20000 and 10000 respectively.

I haven't fully analyzed the algorithm but I take it you're making use of the central limit theorem to simulate the probability distribution for each row (which should tend towards 10000/40000, 20000/40000, 10000/40000 that is 1/4, 1/2, 1/4 probabilities). Thus with a row sum of 4000 the result should be close to 1000, 2000, 1000 but strangely the first row is,

1140, 2319, 541

This adds up to 4000 so it's correct in that way but the numbers deviate far too much from the expected 1000, 2000, 1000 averages so there's something fishy going on. It's the same with all 10 rows.

And in fact you're overextending rand(). This simple generator produces random numbers between 0 and RAND_MAX (which usually is 32767). Then you're using this technique to generate random numbers in the 0 to N-1 range:

Code:

int v = rand() % N;

It's based on that N should be fairly small in relation to what rand() produces to work properly. But in your case it doesn't. In fact when the first row is calculated N is 40000 which is even bigger than RAND_MAX! N decreases with each row but it's still way too big. In short you're skewing the probability distribution away from an assumed even distribution into something you don't quite know what it is really.

But it's possible to improve the situation. Use this instead,

Code:

float d = float(rand()) / float(RAND_MAX+1); // d is in the 0.0 to 1.0 (non inclusive) range
int v = int(d * float(N)); // v is in the 0 to N-1 range

It will produce random numbers in the wanted range as before but in a proper way. Now the first row becomes,

985, 2015, 1000

and all other rows are also closer to the expected 1000, 2000, 1000 averages.

Now the algorithm "works" in the sense that the random variation is a result of the natural variation you get from simulating probability fractions by applying the central limit theorem, and not from a fawlty and unpredictable random number generation. Still I suspect the algorithm is somewhat too complex and what it does can be achieved in a simpler more transparent way. Personally I would generate the partitions according to the normal (Gaussian) distribution instead. It would allow for a nicer more controlled random variation (since you cannot only determine the mean but also the deviation from the mean). Good luck!

Thread: random partition in to k groups

Thread Tools

Display

Threaded View

Re: random partition in to k groups

Posting Permissions