What does RANDPERM miss?
2 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Abstractly, a random permutation generator (RPG) is a function that maps a random bit-string of length b into a permutation of {1,2, ..., n}.
There are 2^b distinct bit-strings of length b and n! distinct permutations of length n. Hence we must have 2^b > n! Using Stirling's approximation for n! and taking logs of both sides of the inequality we get b > n*log2(n).
Now RANDPERM can generate a permutation of length n = 10^6 in 0.2 sec and part of its output looks like this: 383882, 905744, 760491, 138901. However, to generate all of the (10^6)! ~ 10^(5,565,709) possible permutations the inequality above says that b > 10^6*Log2(10^6) ~ 2*10^7 bits are necessary. But RANDPERM uses indirectly the Mersenne Twister generator whose internal state has about 625*32 = 2*10^4 bits and so it cannot possibly generate all (10^6)! permutations. Indeed, it must miss most of these permutations.
My question is: What permutations does RANDPERM (or any other RPG) miss? Are these misses spread uniformly over the permutation space?
4 Commenti
Oliver Woodford
il 26 Lug 2011
The randperm I have (R2010b) looks like this:
[~,p] = sort(rand(1,n));
So in the case that n = 2 (for example), rand(1,n) will generate a huge number of different streams, but the sort will project these down on to only two different permutations.
Risposte (4)
Jan
il 23 Lug 2011
You are right: The Mersenne-Twister cannot create all permutations from a certain limit. The period of 10^6000 means that the sequence of the created 32 bit numbers is repeated after this number of steps. RANDPERM uses 53 bit doubles, which are built from two 32 integers, such that the half period must be taken.
While the sequence of numbers is repeated after 0.5*10^6000 RNG calls only, the values of the 53 bit floating point number are repeated earlier. RANDPERM uses Matlab's SORT, which is a stable quicksort implementation. Here stability means, that the order of equal inputs is kept in the output. This leads to a lovely bias, which can be found without waiting for years by reducing the RNG range drastically:
rngLimit = 5;
n = 40;
elem = 1:n
elemSum = zeros(1, n);
indexSum = zeros(1, n);
for i = 1:10000
random = floor(rand(1, n) * rngLimit);
randomSum = randomSum + random;
[igno, p] = sort(random); % This is RANDPERM
elemSum = elemSum + elem(p);
end
% plot(randomSum) % no bias
plot(elemSum) % cute bias
However, for 53 bits and periods of 0.5*10^6000 the determination of the bias will lead to other problems: There are only about 10^80 nucleons in the universe. This limits the possibilities to store a list of already chosen permutations. So there are some permutations not reachable by RANDPERM, but the chance to suffer from this is tiny.
As you, Derek, know, I suggest not to use RANDPERM for large data sets, but the Fisher-Yates-Shuffle, e.g. implemented in FEX: RPG Lab or FEX: Shuffle. I've implemented the KISS RNG in Shuffle, which has a period of 10^37 and thought of using Marsaglias' CMWC4096 with a period of 10^33000. But as far as I can see, the limits of the universe are a stricter bound of the number of computable permutations.
I'd looking forward to know, if any advanced mathematician shares this estimation.
5 Commenti
Jan
il 26 Lug 2011
@Derek: The difference between the mathematical and physical view is the size of infinity. The physicist tends to accept 10^6! as infinity and then the term _uniformly_ looses its meaning, because there is no physical mechanism to proof that a distribution is not uniformly distributed. (Sorry, this is not 100% exact. A physicist will reject the assumption that a uniform RNG with infinite precision can create the sequence {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 ...}.)
GUIDs are accepted to be unique, although this is mathematically wrong. Checksums like SHA-256 are actually RNGs with a wide low-entropy seed. Nevertheless the degree of uniqueness is sufficient for using them as trusted fingerprints.
Without doubt you are correct, that the determination of permutations is limited, especially the RANDPERM method (therefore I've voted the question). My meta arguments conern only the fact, that it will be very hard to create a real-world application, which suffers from this limitations.
Peter Perkins
il 8 Set 2011
A couple of people in this thread have mentioned the Fisher-Yates-Durstenfeld(-...) shuffle algorithm.
New in R2011b, the RANDPERM function lets you specify a second argument, as in, "give me a random selection of size k from a random permutaion of 1:n". This not only simplifies and speeds up the typical
perm = randperm(n);
r = perm(1:k)
that has been recommended in the past, but also uses F-Y-D. So of course you can ask for a random permutation of all of 1:n as
perm = randperm(n,n)
and have it generated using F-Y-D with the Mersenne Twister underneath. For backwards compatibility, randperm(n) continues to use the existing algorithm based on sort(rand(1,n)).
3 Commenti
Peter Perkins
il 9 Set 2011
yup:
>> rng default
>> randperm(5)
ans =
3 5 1 2 4
>> rng default
>> randperm(5,5)
ans =
4 2 3 1 5
(Not that demonstrates which is which, but you have it right.) Also note that
>> rng default
>> randperm(5,3)
ans =
5 3 2
so you can see that randperm(n,k) is not simply the first k elements of randperm(n,n) -- the latter is never generated, which leads to fast generation of things like
>> randperm(2^30,3)
ans =
521168135 859294610 152349296
By the way, randperm(n,k) can be thought of as "sampling without replacement", as opposed to randi(n,k,1), which can be thought of as "sampling with replacement. Also of potential interest, new in R2011b in Statistics Toolbox if you have it, is the datasample function <http://www.mathworks.com/help/toolbox/stats/datasample.html>, which facilitates both of those, including weighted versions of both (it is intended as an improved version of the existing randsample function in Statistics Toolbox).
Daniel Shub
il 23 Lug 2011
Peter Perkins blog on Loren's blog a while back about the random number generators.
It doesn't answer your question exactly. Your question seems to be more is the Mersenne Twister a good random number generator. The answer to that is, yes, it is one of the better general purpose random number generators available today. Yes, there are sequences it will not generate, however, its repeat time is so long that it is basically impossible to detect that from the sequence itself.
Walter Roberson
il 25 Lug 2011
As long as randperm() is implemented as a stable sort of random numbers that can be duplicates, you cannot trust randperm() requesting more than 1 value.
Suppose you rand(1,2) and both results happen to come out exactly the same. If they cannot come out exactly the same then that would be a test that could be applied to easily distinguish pseudo-random numbers from true-random numbers: true random numbers can have duplicates (and the standard birthday-paradox analysis would make short work of detecting the problem.)
If the two values of rand(1,2) can come out exactly the same (no matter what the period is for full repetition), then the stable sort that is applied to the random numbers is always going to choose the same ordering, not a permutation of that.
This suggests an recursive "saferrandperm" algorithm
[vals, idx] = sort(rand(1,N));
diff(vals)
Look for runs of 0 in the diff: these indicate groups if identical values, and identifying them is an order-N operation.
For any one group of identified identical values, extract the indices, take the length of that, and saferrandperm() that length. Use the indices saferrandperm() returns on the subset size to shuffle the extracted indices in the vector to be returned; if there are more groups of identical values, do the same things for those.
When saferrandperm detects ties in the generated rand() values, it does the same recursive shuffle.
The proportion of duplicated values will get smaller and smaller and as log as the RNG does not generate all-identical values, eventually any one subset will be suitably permuted with the ordering *not* stable.
1 Commento
Jan
il 26 Lug 2011
@Walter: Thanks for this safer version of RANDPERM.
If RANDPERM is patched, I suggest to replace it by the faster Fisher-Yates-Alorithm directly.
Vedere anche
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!