For loop over permutations of 1:n with very large n

Question
Take the following instruction:
for i = 1:1e18
% do something
end
now, probably the execution of the code will take a looong time. I am not concerned with this (see section Context to see why), but with the following problem: consider the instruction
for i = randperm(1e18)
% do something
end
Running the code above returns:
Error using randperm
Requested array exceeds the maximum possible variable size.
My question is: how to perform a for loop over a permutation of the elements of vector 1:n where n is a very large number, without incurring in memory issues? Moreover, how does Matlab perform the first for loop without incurring in such issues?
Context
I am building a simple program with a GUI to solve integer optimization problems. The for loop is used to evaluate the optimization function in the whole search space (which can be very large), and the GUI has a button that the user can press to "break" the loop; when the user does this, the program returns a suboptimal solution of the optimization problem. The second for loop would thus represent a simple implementation of a random search.

2 Commenti

See here for a related question
for i = 1:1e18
% do something
end
Even that simple loop there is a already a problem, since MATLAB cannot store distinct integers up to 1e18, which is beyond the capacity of 53 bits mantissa coding. You don't loop over all integers up to 1e18 as you would expected.

Accedi per commentare.

 Risposta accettata

How long does your loop body take to execute? If it's one loop body execution per second, running all 1e18 of those iterations would take:
y = years(seconds(1e18))
y = 3.1689e+10
31 billion years. According to Wikipedia's timeline of the far future Earth likely will have been destroyed about 23 billion years before the end of execution of that loop, when the Sun expands into a red giant.
If you could do a thousand loop body executions per second:
y = years(seconds(1e18/1000))
y = 3.1689e+07
you're looking at "only" 3 million years. But it'll take fewer days to run the last couple iterations of that loop, since a day on Earth is expected to be one minute longer than it is today.
You can't exactly represent all the numbers between 1 and 1e18 as double precision numbers. So your attempt to choose a random integer-valued double in that range will fail.
1e18 > flintmax
ans = logical
1
Going to uint64 would help.
1e18 > double(intmax('uint64'))
ans = logical
0
So you could use randi to generate two 32-bit integers and typecast the result to uint64. [randi does not let you directly generate 64-bit integers.]
x = typecast(randi([intmin('uint32'), intmax('uint32')], 1, 2, 'uint32'), 'uint64')
x = uint64 11258641691898087842
Reject any trials where x is greater than:
u = uint64(10)^18
u = uint64 1000000000000000000
The odds of repeating the same value of x over a reasonable amount of time is likely very, very small.

8 Commenti

Your solution gets the job done, even though it is not as "elegant" as I would like it to be. Any idea for solving the issue of repeating the same value of x? I know it has low practical relevance for such high values, take it as a more "theoretical" issue. Also, I still do not understand how Matlab performs this:
for i = 1:1e18
% do something
end
Note that the code above does not result in memory errors, even though we are talking about a for loop that will probably not end in 3 million years, as you point out.
Why should the loop
for i = 1:1e18
% do something
end
have memory problems ? i is not a vector with 1e18 elements, but an integer between 1 and 1e18 in each iteration of the loop. This is in contrast to
for i = randperm(1e18)
where i theoretically would be a random permutation of the integers 1,2,3,..., 1e18 - a column vector with 1e18 elements.
Note that the documentation page for the for keyword lists three syntaxes in the Description section. MATLAB can optimize the first two cases, initVal:endVal and initVal:step:endVal, to avoid explicitly creating the vector. It's this optimization that allows something like the following to "work" (== "not error".) You're going to want to Ctrl-C out of the execution before it ends; the drawnow call is to let MATLAB have a chance to "notice" the Ctrl-C. I'm not going to try to run this on MATLAB Answers for what I hope are obvious reasons.
for k = 1:Inf
drawnow
end
Back when we had a 32-bit MATLAB, I think that took only a couple minutes to run because the warning mentioned something near intmax('int32') and 2 billion-ish iterations of a loop that did nothing or next to nothing was still pretty quick. On a 64-bit MATLAB, it would take much, much longer.
Your approach with randperm uses the third syntax, valArray. Since there's no definitive ordering to the elements of randperm MATLAB can use to optimize this case it has to try to evaluate the function call and use its return value as the array over which to iterate. MATLAB doesn't even get to start trying to iterate; creating the vector of values fails.
As for detecting duplicates, anything that lets you record values and determine if a value is present in the record (a vector and ismember, a dictionary and isKey, a logical sparse matrix [which you'd need to use mod and rem or use ind2sub on your uint64 value to generate indices into, since you can't create one with 1e18 rows but you could create one with 1e14 rows and 1e4 columns] and indexing) would theoretically work.
@Torsten: note that what you wrote is not true. Calling randperm(n) returns a row vector as 1:n, therefore i assumes scalar values in the for loop. You can see this by running
for i = randperm(4)
i
end
i = 4
i = 3
i = 1
i = 2
and comparing the result with
for i = randperm(4)'
i
end
i = 4×1
2 4 3 1
@Steven Lord: that's interesting, thanks. I guess I should have read once more the documentation on the for keyword. Still, I would be curious to know whether something I have in mind can be done. I try to formulate the question in a different way. Is there a function res = myfunc(i,n), where i <= n and res is a positive integer such that res <= n, with the following properties:
  • myfunc(i,n) ~= myfunc(j,n) for all i ~= j,
  • the output of the function with respect of i (i.e., with fixed n) "feels random" (of course this requirement is not very precise),
  • running myfunc(i,1e18) does not return a memory error.
Note that without the second specification a valid solution would be the identity function, i.e., the function that returns res = i.
If such a function exists, then I could use it in the following way.
for ii = 1:1e18
i = myfunc(ii,1e18);
% do something
end
'myfunc' is used in Create Function to Represent a Grey-Box Model.
I was not clear enough with what I wrote, but @Steven Lord explained it:
For your loop to start, a vector with 1e18 elements - namely randperm(1e18) - has to be created and hold in memory. That's not the case for a simple loop of type
for i = initVal:endVal
or
for i = initVal:step:endVal
Suppose that you had a random permutation generator that could generate additional numbers incrementally rather than having to generate the entire sequence at once (which would have a heavy memory penalty).
Now, at any given time, how does the algorithm know whether it has generated a particular value already if there is not at least a binary "already used" map, size proportional to log2(n) ? (so, 2^57 bytes in the case of n = 1e18) ?
The answer is that it would have to be an algorithm that cannot generate a previous value, and so does not need to keep track of where it has been.
Do such algorithms exist? Well, yes and no. In the case where the n to generate over is a prime, then you can use a https://en.wikipedia.org/wiki/Linear_congruential_generator Linear Congruential Generator . With proper n and proper selection of parameters, such generators are certain not to repeat until they reach the end of the sequence. The problem is that the random statistics are Not Good. For example if you plot the values over a 2D box, you will see distinct lines being filled out, rather than the space being filled randomly.
Are there alternatives? Maybe. The discussion at https://en.wikipedia.org/wiki/Mersenne_Twister refers to some alternatives. Mersenne Twister itself is not suitable for such a purpose: it uses a state vector of 624 words, and any one word can (will!!) repeat before the end of the overall period; when used as a pseudo-random number generator, it is deliberate that MT can repeat individual entries: in most cases when a permulation is not being generated, if the generator cannot produce duplicates before the end of the period, then you would be able to use that fact to tell that you are not generating true random integers. (With true random integers, the more integers you generate, the more likely it becomes that you will generate an integer that has already been generated.)
@Walter Roberson: thank you for your insightful comment. This is exactly what I was looking for.

Accedi per commentare.

Più risposte (1)

Bruno Luong
Bruno Luong il 27 Apr 2023
Modificato: Bruno Luong il 27 Apr 2023
If you don't care about finding optimal solution, why not just do
n = 1e15; % 1e18 has problem of coding as in my comment above
for j = 1:n
i = randi(n);
% do something with i
fprintf('i=%d\n',i);
end
This is more or less has periodicity as Walter comment of LCG (yes behind rand, randi, randperm is Merssene Twister), but collision occurs when compress in smaller range 1:n. But you say you want to find suboptimal solution,so collision should only have negligible impact to runing time and not on randomness covering.

5 Commenti

The idea is however that the code should work also when n is small (admittedly, this was not explained in my question). Suppose that n = 2, with your code it is possible that i will only assume the value of 1 (twice), thus never exploring the whole search space.
To be more clear: a priori I do not know how large n is; it is up to the user to decide that. If n is very small, the user will probably decide to wait the for loop to terminate, and the code should explore the whole search space. If n is "medium size" and the code will take some hours to complete, I want the user to decide for him/herself whether it is worth to wait: if it is worth to wait, the code should explore the whole search space, if not, the user will decide to stop earlier and explore only a subspace. Finally, if n is very large (like 1e15 kind of large), the code will definitely not terminate; still, it should not throw an error because of memory issues, and the user should be given the possibility to break the loop at will.
Ideally, I would like the code to be general, without having to define what "small", "medium", or "large size" means, in terms of n. It should work on a computer from NASA and on a Raspberry PI in the same way, without setting parameters that define when to change strategies for the random serarch based on how much memory the user's computer has.
It seems that a linear congruential generator with a smart choice of its parameters should do the trick (even though its random statistics are not that good), as suggested here by @Walter Roberson.
I assume you have very large n and user stops by push button in your Context description.
For small n, just loop a little bit more so the chance of full cover increases.
OK my solution gives collision in generated number so it is far from being ideal. But it is probably the simplest modification without keeping track of what has been generated and the associated memory issue you encount.
The LCG will have other issues that you must face when n is not prime.
n = 2; % 1e18 has problem of coding as in my comment above
for j = 1:2*n % more than n for better coverage
i = randi(n);
% do something with i
fprintf('i=%d\n',i);
end
i=1 i=1 i=2 i=2
n = 10; % 1e18 has problem of coding as in my comment above
for j = 1:2*n
i = randi(n);
% do something with i
fprintf('i=%d\n',i);
end
i=4 i=10 i=1 i=3 i=6 i=5 i=3 i=9 i=9 i=3 i=7 i=3 i=7 i=1 i=10 i=5 i=2 i=3 i=3 i=6
@Bruno Luong: "The LCG will have other issues that you must face when n is not prime."
My idea to deal with a nonprime n would be to set up the LCG with n1, where n1 is the smaller prime number larger than n (to find n1, see this discussion), and then ignore (using continue) all values of i such that i > n.
Bruno Luong
Bruno Luong il 27 Apr 2023
Modificato: Bruno Luong il 27 Apr 2023
Remember my warning: you cannot store in standard matlab double any integer larger than 2^53 ~ 1e16. Let alone doing any meaningful arithmetics and computing prime beyond that.
https://en.m.wikipedia.org/wiki/Linear-feedback_shift_register

Accedi per commentare.

Categorie

Scopri di più su Programming in Centro assistenza e File Exchange

Prodotti

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by