What does RANDPERM miss?

2 visualizzazioni (ultimi 30 giorni)
Derek O'Connor
Derek O'Connor il 23 Lug 2011
Abstractly, a random permutation generator (RPG) is a function that maps a random bit-string of length b into a permutation of {1,2, ..., n}.
There are 2^b distinct bit-strings of length b and n! distinct permutations of length n. Hence we must have 2^b > n! Using Stirling's approximation for n! and taking logs of both sides of the inequality we get b > n*log2(n).
Now RANDPERM can generate a permutation of length n = 10^6 in 0.2 sec and part of its output looks like this: 383882, 905744, 760491, 138901. However, to generate all of the (10^6)! ~ 10^(5,565,709) possible permutations the inequality above says that b > 10^6*Log2(10^6) ~ 2*10^7 bits are necessary. But RANDPERM uses indirectly the Mersenne Twister generator whose internal state has about 625*32 = 2*10^4 bits and so it cannot possibly generate all (10^6)! permutations. Indeed, it must miss most of these permutations.
My question is: What permutations does RANDPERM (or any other RPG) miss? Are these misses spread uniformly over the permutation space?
  4 Commenti
Oliver Woodford
Oliver Woodford il 26 Lug 2011
The randperm I have (R2010b) looks like this:
[~,p] = sort(rand(1,n));
So in the case that n = 2 (for example), rand(1,n) will generate a huge number of different streams, but the sort will project these down on to only two different permutations.
Derek O'Connor
Derek O'Connor il 26 Lug 2011
Oliver,
Thanks for your example. I see what you mean now.
Here is a crude generalization:
1. If the size of the RNG space, |R|, is greater than the size of
the permutation space |P|, then FYS(U) will be many to 1.
2. If |R| = |P| then FYS(U) will be 1:1
3. If |R| << |P| then FYS(U) will be ??
It is the third case that bothers me and prompted my question.

Accedi per commentare.

Risposte (4)

Jan
Jan il 23 Lug 2011
You are right: The Mersenne-Twister cannot create all permutations from a certain limit. The period of 10^6000 means that the sequence of the created 32 bit numbers is repeated after this number of steps. RANDPERM uses 53 bit doubles, which are built from two 32 integers, such that the half period must be taken.
While the sequence of numbers is repeated after 0.5*10^6000 RNG calls only, the values of the 53 bit floating point number are repeated earlier. RANDPERM uses Matlab's SORT, which is a stable quicksort implementation. Here stability means, that the order of equal inputs is kept in the output. This leads to a lovely bias, which can be found without waiting for years by reducing the RNG range drastically:
rngLimit = 5;
n = 40;
elem = 1:n
elemSum = zeros(1, n);
indexSum = zeros(1, n);
for i = 1:10000
random = floor(rand(1, n) * rngLimit);
randomSum = randomSum + random;
[igno, p] = sort(random); % This is RANDPERM
elemSum = elemSum + elem(p);
end
% plot(randomSum) % no bias
plot(elemSum) % cute bias
However, for 53 bits and periods of 0.5*10^6000 the determination of the bias will lead to other problems: There are only about 10^80 nucleons in the universe. This limits the possibilities to store a list of already chosen permutations. So there are some permutations not reachable by RANDPERM, but the chance to suffer from this is tiny.
As you, Derek, know, I suggest not to use RANDPERM for large data sets, but the Fisher-Yates-Shuffle, e.g. implemented in FEX: RPG Lab or FEX: Shuffle. I've implemented the KISS RNG in Shuffle, which has a period of 10^37 and thought of using Marsaglias' CMWC4096 with a period of 10^33000. But as far as I can see, the limits of the universe are a stricter bound of the number of computable permutations.
I'd looking forward to know, if any advanced mathematician shares this estimation.
  5 Commenti
Jan
Jan il 26 Lug 2011
@Derek: The difference between the mathematical and physical view is the size of infinity. The physicist tends to accept 10^6! as infinity and then the term _uniformly_ looses its meaning, because there is no physical mechanism to proof that a distribution is not uniformly distributed. (Sorry, this is not 100% exact. A physicist will reject the assumption that a uniform RNG with infinite precision can create the sequence {1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 ...}.)
GUIDs are accepted to be unique, although this is mathematically wrong. Checksums like SHA-256 are actually RNGs with a wide low-entropy seed. Nevertheless the degree of uniqueness is sufficient for using them as trusted fingerprints.
Without doubt you are correct, that the determination of permutations is limited, especially the RANDPERM method (therefore I've voted the question). My meta arguments conern only the fact, that it will be very hard to create a real-world application, which suffers from this limitations.
Derek O'Connor
Derek O'Connor il 4 Ago 2011
@Jan
Assuming that the Fisher-Yates Shuffle produces permutations
uniformly if it uses a good RNG with period T, and if we use
Brent's & Matsumoto's recommendation that we generate no more than
T^(1/3) RNs then we must restrict n so that n! < T^(1/3). This
gives
1. n = 11 for KISS, T = 10^37
2. n = 702 for MT, T = 10^6000
3. n = 3690 for CMWC4096, T = 10^39460
These values of n are very small.
I have come to the pessimistic conclusion that we may never be able
to generate large permutations, with a guarantee of uniformity.
The same conclusion applies to generating random samples: there are
C(N,n) possible samples of size n from a population of size N. Thus
a function that uses KISS to generate samples of size 500 from an
array of size 10^6 cannot possibly cover the C(10^6,500) ~ 10^1866
possible samples.
J. Gentle, in his book "Random Number Generation and Monte Carlo
Methods", 2nd Ed., 2003, has the following comment at the bottom of
page 220:
`In no other application considered in this book does the
limitation of a finite period stand out so clearly as it does in
the case of generation of random samples and permutations.'
Also, he points out that these limitations have been known for many
years. See:
Greenwood, J. Arthur (1976a), The demands of trivial combinatorial
problems on random number generators, "Proceedings of the Ninth
Interface Symposium on Computer Science and Statistics"
A final problem: empirically testing an RPG for uniformity seems to
be difficult, even for small values of n.
The standard Chi-Squared goodness-of-fit test requires a sample
size ns > 5*n! for permutations of size n. Hence, to test the
output of an RPG that produces permutations of size n = 20, would
require samples of size ns = 5*20! ~ 10^19. This is clearly
impracticable.
There may be other tests for uniformity that require less work.
Does anyone have suggestions?

Accedi per commentare.


Peter Perkins
Peter Perkins il 8 Set 2011
A couple of people in this thread have mentioned the Fisher-Yates-Durstenfeld(-...) shuffle algorithm.
New in R2011b, the RANDPERM function lets you specify a second argument, as in, "give me a random selection of size k from a random permutaion of 1:n". This not only simplifies and speeds up the typical
perm = randperm(n);
r = perm(1:k)
that has been recommended in the past, but also uses F-Y-D. So of course you can ask for a random permutation of all of 1:n as
perm = randperm(n,n)
and have it generated using F-Y-D with the Mersenne Twister underneath. For backwards compatibility, randperm(n) continues to use the existing algorithm based on sort(rand(1,n)).
  3 Commenti
Peter Perkins
Peter Perkins il 9 Set 2011
yup:
>> rng default
>> randperm(5)
ans =
3 5 1 2 4
>> rng default
>> randperm(5,5)
ans =
4 2 3 1 5
(Not that demonstrates which is which, but you have it right.) Also note that
>> rng default
>> randperm(5,3)
ans =
5 3 2
so you can see that randperm(n,k) is not simply the first k elements of randperm(n,n) -- the latter is never generated, which leads to fast generation of things like
>> randperm(2^30,3)
ans =
521168135 859294610 152349296
By the way, randperm(n,k) can be thought of as "sampling without replacement", as opposed to randi(n,k,1), which can be thought of as "sampling with replacement. Also of potential interest, new in R2011b in Statistics Toolbox if you have it, is the datasample function <http://www.mathworks.com/help/toolbox/stats/datasample.html>, which facilitates both of those, including weighted versions of both (it is intended as an improved version of the existing randsample function in Statistics Toolbox).
Derek O'Connor
Derek O'Connor il 9 Set 2011
Peter,
Is randperm(2^30,3) ans = 521168135 859294610 152349296 faster than
this:
>> L=-2^25;U=2^25;ns=3;tic;
>> [S,nw] = RandSampRejBitsP(L,U,ns);disp(toc);disp(nw);disp(S(1:3))
0.00065465 secs
0 random numbers rejected
8440023 -11312240 11620411
>> L=-2^25;U=2^25;ns=2^15;tic;
>> [S,nw] = RandSampRejBitsP(L,U,ns);disp(toc);disp(nw);disp(S(1:3))
0.022715 secs
6 random numbers rejected
-31191737 27713048 16443262
RandSampRejBitsP can't go up to 2^30 and it needs a warm-up call to
set up the large *persistent* N = (U-L+1)= 2^26 "bit vector"
(logical Member(1,N)). Memory is cheap.
Dell Precision 690, Intel 2xQuad-Core E5345 @ 2.33GHz, 16GB RAM
Windows7 64-bit Prof., MATLAB R2008a, 7.6.0.324

Accedi per commentare.


Daniel Shub
Daniel Shub il 23 Lug 2011
Peter Perkins blog on Loren's blog a while back about the random number generators.
It doesn't answer your question exactly. Your question seems to be more is the Mersenne Twister a good random number generator. The answer to that is, yes, it is one of the better general purpose random number generators available today. Yes, there are sequences it will not generate, however, its repeat time is so long that it is basically impossible to detect that from the sequence itself.
  1 Commento
Derek O'Connor
Derek O'Connor il 23 Lug 2011
Daniel,
You are "putting words in my mouth": I do not question the goodness
of the Mersenne Twister. It is one of the best of the "modern"
generators and has been tested and used by many. Its period is
about 2^{19937} ~ 10^{6000} and is equidistributed in 623 dimensions.
There are better generators, according to George Marsaglia
(http://groups.google.com/group/sci.crypt/browse_thread/thread/305c507efbe85be4)
It seems to me that if we use any RNG whose internal state has b
bits then we should restrict the size of the permutations generated
by any RPG that uses it so that n*log2(n) < b. If an RPG uses the
MT generator then it should not be allowed to generate permutations
of size greater than about n = 1830, with n*log2(n) = 19833. If we
allow n to be greater than 1830 then the RPG must miss some of the
permutations. Hence my question.

Accedi per commentare.


Walter Roberson
Walter Roberson il 25 Lug 2011
As long as randperm() is implemented as a stable sort of random numbers that can be duplicates, you cannot trust randperm() requesting more than 1 value.
Suppose you rand(1,2) and both results happen to come out exactly the same. If they cannot come out exactly the same then that would be a test that could be applied to easily distinguish pseudo-random numbers from true-random numbers: true random numbers can have duplicates (and the standard birthday-paradox analysis would make short work of detecting the problem.)
If the two values of rand(1,2) can come out exactly the same (no matter what the period is for full repetition), then the stable sort that is applied to the random numbers is always going to choose the same ordering, not a permutation of that.
This suggests an recursive "saferrandperm" algorithm
[vals, idx] = sort(rand(1,N));
diff(vals)
Look for runs of 0 in the diff: these indicate groups if identical values, and identifying them is an order-N operation.
For any one group of identified identical values, extract the indices, take the length of that, and saferrandperm() that length. Use the indices saferrandperm() returns on the subset size to shuffle the extracted indices in the vector to be returned; if there are more groups of identical values, do the same things for those.
When saferrandperm detects ties in the generated rand() values, it does the same recursive shuffle.
The proportion of duplicated values will get smaller and smaller and as log as the RNG does not generate all-identical values, eventually any one subset will be suitably permuted with the ordering *not* stable.
  1 Commento
Jan
Jan il 26 Lug 2011
@Walter: Thanks for this safer version of RANDPERM.
If RANDPERM is patched, I suggest to replace it by the faster Fisher-Yates-Alorithm directly.

Accedi per commentare.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by