How do you generate a random permutation of numbers from 0 to N without storing it in an array ?
Note that duplicates are to be avoided so I can't use rand() calls to generate them.
The problem I have is to partition N number of data points into k groups in random selection manner.
I thought of random shuffling of indices using random_shuffle of stl but the indices need to be stored in an array.
Is there a way to get N integers from [0..n] sorted in a random order, using which I can access a database ?
N can be very big (thousands or millions). Maybe I am looking at this problem from a wrong perspective so let me describe the actual problem.
I have data of wins,draws,losses N = W + D + L. For example 100000 wins, 40000 draws and 60000 losses = 200000 total.
Now I want to partition this into say 10 groups, the first group could be f.i (20000w,10234d,34000l). So once I generate all the 10 groups like that they should add up to the total values.
Edit: It seems what I need is to generate k random numbers that add up to W, another that add up to D etc. Maybe that will work?
Last edited by dshawul; August 31st, 2012 at 09:32 AM.
It seems what I need is to generate k random numbers that add up to W
If you want k sections you generate k-1 numbers within the range representing the borders between sections. To that you add the two range limits giving k+1 numbers. Then you sort them. Finally to get the k sections you subtract each number from its next smaller neighbour (if one exists and maybe add 1).
Well it seems the easiest approach of picking one at a time randomly works well.
I've run your example and I think I better understand the problem now. Using the specific numbers from the example you want each of the 10 rows to sum up to 4000 and each of the 3 columns to sum up to 10000, 20000 and 10000 respectively.
I haven't fully analyzed the algorithm but I take it you're making use of the central limit theorem to simulate the probability distribution for each row (which should tend towards 10000/40000, 20000/40000, 10000/40000 that is 1/4, 1/2, 1/4 probabilities). Thus with a row sum of 4000 the result should be close to 1000, 2000, 1000 but strangely the first row is,
1140, 2319, 541
This adds up to 4000 so it's correct in that way but the numbers deviate far too much from the expected 1000, 2000, 1000 averages so there's something fishy going on. It's the same with all 10 rows.
And in fact you're overextending rand(). This simple generator produces random numbers between 0 and RAND_MAX (which usually is 32767). Then you're using this technique to generate random numbers in the 0 to N-1 range:
int v = rand() % N;
It's based on that N should be fairly small in relation to what rand() produces to work properly. But in your case it doesn't. In fact when the first row is calculated N is 40000 which is even bigger than RAND_MAX! N decreases with each row but it's still way too big. In short you're skewing the probability distribution away from an assumed even distribution into something you don't quite know what it is really.
But it's possible to improve the situation. Use this instead,
float d = float(rand()) / float(RAND_MAX+1); // d is in the 0.0 to 1.0 (non inclusive) range
int v = int(d * float(N)); // v is in the 0 to N-1 range
It will produce random numbers in the wanted range as before but in a proper way. Now the first row becomes,
985, 2015, 1000
and all other rows are also closer to the expected 1000, 2000, 1000 averages.
Now the algorithm "works" in the sense that the random variation is a result of the natural variation you get from simulating probability fractions by applying the central limit theorem, and not from a fawlty and unpredictable random number generation. Still I suspect the algorithm is somewhat too complex and what it does can be achieved in a simpler more transparent way. Personally I would generate the partitions according to the normal (Gaussian) distribution instead. It would allow for a nicer more controlled random variation (since you cannot only determine the mean but also the deviation from the mean). Good luck!
Last edited by nuzzle; September 3rd, 2012 at 12:19 AM.
Sorry I didn't look back to this thread once I thought I had something working. Yes you are right there was something fishy about the code I posted. Running the code 10 times will reveal a pattern that shows W decreasing from sample 1 to 10. At the time I thought it was becauseit was running out cycles when using rand() and replacing it with a better PRNG such as a mersenne twister actually gave a better result. But as you pointed out there is more to it and it may be equally good to use rand() when I fix that mistake.
Anyway there is a better method than the brute force approach I followed. Using hyper-geometric sampling (without replacement) gives a much faster method. But it requires a complex function for that which I found from R package.