For loop over permutations of 1:n with very large n

Question

0 voti

Question

Take the following instruction:

for i = 1:1e18
    % do something
end

now, probably the execution of the code will take a looong time. I am not concerned with this (see section Context to see why), but with the following problem: consider the instruction

for i = randperm(1e18)
    % do something
end

Running the code above returns:

Error using randperm
Requested array exceeds the maximum possible variable size.

My question is: how to perform a for loop over a permutation of the elements of vector 1:n where n is a very large number, without incurring in memory issues? Moreover, how does Matlab perform the first for loop without incurring in such issues?

Context

I am building a simple program with a GUI to solve integer optimization problems. The for loop is used to evaluate the optimization function in the whole search space (which can be very large), and the GUI has a button that the user can press to "break" the loop; when the user does this, the program returns a suboptimal solution of the optimization problem. The second for loop would thus represent a simple implementation of a random search.

2 Commenti
Mostra Nessuno Nascondi Nessuno

Davide Zorzenon il 26 Apr 2023

See here for a related question

Bruno Luong il 27 Apr 2023

Modificato: Bruno Luong il 27 Apr 2023

Apri in MATLAB Online

for i = 1:1e18
    % do something
end

Even that simple loop there is a already a problem, since MATLAB cannot store distinct integers up to 1e18, which is beyond the capacity of 53 bits mantissa coding. You don't loop over all integers up to 1e18 as you would expected.

Accedi per commentare.

Accedi per rispondere a questa domanda.

Follow Question

Answer 1

Steven Lord il 25 Apr 2023

Apri in MATLAB Online

1 voto

How long does your loop body take to execute? If it's one loop body execution per second, running all 1e18 of those iterations would take:

y = years(seconds(1e18))
y = 3.1689e+10

31 billion years. According to Wikipedia's timeline of the far future Earth likely will have been destroyed about 23 billion years before the end of execution of that loop, when the Sun expands into a red giant.

If you could do a thousand loop body executions per second:

y = years(seconds(1e18/1000))
y = 3.1689e+07

you're looking at "only" 3 million years. But it'll take fewer days to run the last couple iterations of that loop, since a day on Earth is expected to be one minute longer than it is today.

You can't exactly represent all the numbers between 1 and 1e18 as double precision numbers. So your attempt to choose a random integer-valued double in that range will fail.

1e18 > flintmax
ans = logical
   1

Going to uint64 would help.

1e18 > double(intmax('uint64'))
ans = logical
   0

So you could use randi to generate two 32-bit integers and typecast the result to uint64. [randi does not let you directly generate 64-bit integers.]

x = typecast(randi([intmin('uint32'), intmax('uint32')], 1, 2, 'uint32'), 'uint64')
x = uint64
    11258641691898087842

Reject any trials where x is greater than:

u = uint64(10)^18
u = uint64
    1000000000000000000

The odds of repeating the same value of x over a reasonable amount of time is likely very, very small.

8 Commenti
Mostra 6 commenti meno recenti Nascondi 6 commenti meno recenti

Steven Lord il 25 Apr 2023

Apri in MATLAB Online

Note that the documentation page for the for keyword lists three syntaxes in the Description section. MATLAB can optimize the first two cases, initVal:endVal and initVal:step:endVal, to avoid explicitly creating the vector. It's this optimization that allows something like the following to "work" (== "not error".) You're going to want to Ctrl-C out of the execution before it ends; the drawnow call is to let MATLAB have a chance to "notice" the Ctrl-C. I'm not going to try to run this on MATLAB Answers for what I hope are obvious reasons.

for k = 1:Inf
    drawnow
end

Back when we had a 32-bit MATLAB, I think that took only a couple minutes to run because the warning mentioned something near intmax('int32') and 2 billion-ish iterations of a loop that did nothing or next to nothing was still pretty quick. On a 64-bit MATLAB, it would take much, much longer.

Your approach with randperm uses the third syntax, valArray. Since there's no definitive ordering to the elements of randperm MATLAB can use to optimize this case it has to try to evaluate the function call and use its return value as the array over which to iterate. MATLAB doesn't even get to start trying to iterate; creating the vector of values fails.

As for detecting duplicates, anything that lets you record values and determine if a value is present in the record (a vector and ismember, a dictionary and isKey, a logical sparse matrix [which you'd need to use mod and rem or use ind2sub on your uint64 value to generate indices into, since you can't create one with 1e18 rows but you could create one with 1e14 rows and 1e4 columns] and indexing) would theoretically work.

Davide Zorzenon il 25 Apr 2023

Modificato: Davide Zorzenon il 26 Apr 2023

Apri in MATLAB Online

@Torsten: note that what you wrote is not true. Calling randperm(n) returns a row vector as 1:n, therefore i assumes scalar values in the for loop. You can see this by running

for i = randperm(4)
  i
end
i = 4
i = 3
i = 1
i = 2

and comparing the result with

for i = randperm(4)'
  i
end
i = 4×1
     2
     4
     3
     1

@Steven Lord: that's interesting, thanks. I guess I should have read once more the documentation on the for keyword. Still, I would be curious to know whether something I have in mind can be done. I try to formulate the question in a different way. Is there a function res = myfunc(i,n), where i <= n and res is a positive integer such that res <= n, with the following properties:

myfunc(i,n) ~= myfunc(j,n) for all i ~= j,
the output of the function with respect of i (i.e., with fixed n) "feels random" (of course this requirement is not very precise),
running myfunc(i,1e18) does not return a memory error.

Note that without the second specification a valid solution would be the identity function, i.e., the function that returns res = i.

If such a function exists, then I could use it in the following way.

for ii = 1:1e18
  i = myfunc(ii,1e18);
  % do something
end
'myfunc' is used in Create Function to Represent a Grey-Box Model.

Walter Roberson il 25 Apr 2023

@Davide Zorzenon

Suppose that you had a random permutation generator that could generate additional numbers incrementally rather than having to generate the entire sequence at once (which would have a heavy memory penalty).

Now, at any given time, how does the algorithm know whether it has generated a particular value already if there is not at least a binary "already used" map, size proportional to log2(n) ? (so, 2^57 bytes in the case of n = 1e18) ?

The answer is that it would have to be an algorithm that cannot generate a previous value, and so does not need to keep track of where it has been.

Do such algorithms exist? Well, yes and no. In the case where the n to generate over is a prime, then you can use a https://en.wikipedia.org/wiki/Linear_congruential_generator Linear Congruential Generator . With proper n and proper selection of parameters, such generators are certain not to repeat until they reach the end of the sequence. The problem is that the random statistics are Not Good. For example if you plot the values over a 2D box, you will see distinct lines being filled out, rather than the space being filled randomly.

Are there alternatives? Maybe. The discussion at https://en.wikipedia.org/wiki/Mersenne_Twister refers to some alternatives. Mersenne Twister itself is not suitable for such a purpose: it uses a state vector of 624 words, and any one word can (will!!) repeat before the end of the overall period; when used as a pseudo-random number generator, it is deliberate that MT can repeat individual entries: in most cases when a permulation is not being generated, if the generator cannot produce duplicates before the end of the period, then you would be able to use that fact to tell that you are not generating true random integers. (With true random integers, the more integers you generate, the more likely it becomes that you will generate an integer that has already been generated.)

Davide Zorzenon il 26 Apr 2023

@Walter Roberson: thank you for your insightful comment. This is exactly what I was looking for.

Accedi per commentare.

Answer 2

Bruno Luong il 27 Apr 2023

Modificato: Bruno Luong il 27 Apr 2023

Apri in MATLAB Online

0 voti

If you don't care about finding optimal solution, why not just do

n = 1e15; % 1e18 has problem of coding as in my comment above
for j = 1:n
    i = randi(n);
    % do something with i
    fprintf('i=%d\n',i);
end

This is more or less has periodicity as Walter comment of LCG (yes behind rand, randi, randperm is Merssene Twister), but collision occurs when compress in smaller range 1:n. But you say you want to find suboptimal solution,so collision should only have negligible impact to runing time and not on randomness covering.

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

Davide Zorzenon il 27 Apr 2023

Modificato: Davide Zorzenon il 27 Apr 2023

The idea is however that the code should work also when n is small (admittedly, this was not explained in my question). Suppose that n = 2, with your code it is possible that i will only assume the value of 1 (twice), thus never exploring the whole search space.

To be more clear: a priori I do not know how large n is; it is up to the user to decide that. If n is very small, the user will probably decide to wait the for loop to terminate, and the code should explore the whole search space. If n is "medium size" and the code will take some hours to complete, I want the user to decide for him/herself whether it is worth to wait: if it is worth to wait, the code should explore the whole search space, if not, the user will decide to stop earlier and explore only a subspace. Finally, if n is very large (like 1e15 kind of large), the code will definitely not terminate; still, it should not throw an error because of memory issues, and the user should be given the possibility to break the loop at will.

Ideally, I would like the code to be general, without having to define what "small", "medium", or "large size" means, in terms of n. It should work on a computer from NASA and on a Raspberry PI in the same way, without setting parameters that define when to change strategies for the random serarch based on how much memory the user's computer has.

It seems that a linear congruential generator with a smart choice of its parameters should do the trick (even though its random statistics are not that good), as suggested here by @Walter Roberson.

Bruno Luong il 27 Apr 2023

Modificato: Bruno Luong il 27 Apr 2023

Remember my warning: you cannot store in standard matlab double any integer larger than 2^53 ~ 1e16. Let alone doing any meaningful arithmetics and computing prime beyond that.

Walter Roberson il 27 Apr 2023

https://en.m.wikipedia.org/wiki/Linear-feedback_shift_register

Accedi per commentare.

For loop over permutations of 1:n with very large n

2 Commenti
Mostra Nessuno Nascondi Nessuno

Risposta accettata

8 Commenti
Mostra 6 commenti meno recenti Nascondi 6 commenti meno recenti

Più risposte (1)

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

Categorie

Prodotti

Tag

Community Treasure Hunt

For loop over permutations of 1:n with very large n

2 Commenti Mostra Nessuno Nascondi Nessuno

Risposta accettata

8 Commenti Mostra 6 commenti meno recenti Nascondi 6 commenti meno recenti

Più risposte (1)

5 Commenti Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti

Categorie

Prodotti

Tag

Vedere anche

Community Treasure Hunt

2 Commenti
Mostra Nessuno Nascondi Nessuno

8 Commenti
Mostra 6 commenti meno recenti Nascondi 6 commenti meno recenti

5 Commenti
Mostra 3 commenti meno recenti Nascondi 3 commenti meno recenti