Help creating an array with uniformly distributed random numbers (row-wise) comprised between 0 and 1, with each column having a sum of 1

3 visualizzazioni (ultimi 30 giorni)
Hi All,
I am trying to solve this problem, but I am not even sure this is possible.
I need to run a model 1e8+ times. To do this - making sure the model is unbiased - I need to obtain 4 random values comprised between 0 and 1 at each run, and their sum must equal 1 (these values will be area fractions).
So I have tried the following:
Random_Fraction_a = rand(1, 4); % Random Numbers
Random_Fraction_b = Random_Fraction_a ./ sum(Random_Fraction_a); % Random Numbers wich their sum = 1
However, as depicted in the figure below, 'rand' returns uniformly distributed random numbers, which is exactly what I am after, but when I normalise them between [0 1], the distribution changes completely. Now, it is my understanding that this is "statistically normal", but it is not what I need. Do you know, or can you think of any alternative way to obtain 4 random decimal values comprised between 0 and 1, wich their sum gives 1, and that are uniformly distributed row-wise?
Any help is grately appreciated!
See figure (and code) below:
% Example code
for i=1:1e5
Random_Fraction_a = rand(1, 4);
Random_Fraction_b = Random_Fraction_a ./ sum(Random_Fraction_a);
Rand_Fa(:,i)=Random_Fraction_a;
Rand_Fb(:,i)=Random_Fraction_b;
end
figure
for ii=1:4
subplot(2,2,ii)
histogram(Rand_Fa(ii,:))
hold on
histogram(Rand_Fb(ii,:),"FaceAlpha",0.5)
legend('Rand Output','After Normalisation')
end

Risposta accettata

Torsten
Torsten il 14 Lug 2023
Modificato: Torsten il 14 Lug 2023
You cannot expect the usual uniform distribution for the 4 rows of the matrix. This is impossible since each vector component poses conditions on the others in the corresponding column.
What you can expect is that columns of the matrix are uniformly distributed over the part of the hyperplane x1+x2+x3+x4 = 1 with xi >= 0.
I don't know if this serves your needs.
x = randfixedsum(4,1e5,1,0,1);
for ii=1:4
subplot(2,2,ii)
histogram(x(ii,:),"FaceAlpha",0.5)
legend('After Normalisation')
end
function [x,v] = randfixedsum(n,m,s,a,b)
% [x,v] = randfixedsum(n,m,s,a,b)
%
% This generates an n by m array x, each of whose m columns
% contains n random values lying in the interval [a,b], but
% subject to the condition that their sum be equal to s. The
% scalar value s must accordingly satisfy n*a <= s <= n*b. The
% distribution of values is uniform in the sense that it has the
% conditional probability distribution of a uniform distribution
% over the whole n-cube, given that the sum of the x's is s.
%
% The scalar v, if requested, returns with the total
% n-1 dimensional volume (content) of the subset satisfying
% this condition. Consequently if v, considered as a function
% of s and divided by sqrt(n), is integrated with respect to s
% from s = a to s = b, the result would necessarily be the
% n-dimensional volume of the whole cube, namely (b-a)^n.
%
% This algorithm does no "rejecting" on the sets of x's it
% obtains. It is designed to generate only those that satisfy all
% the above conditions and to do so with a uniform distribution.
% It accomplishes this by decomposing the space of all possible x
% sets (columns) into n-1 dimensional simplexes. (Line segments,
% triangles, and tetrahedra, are one-, two-, and three-dimensional
% examples of simplexes, respectively.) It makes use of three
% different sets of 'rand' variables, one to locate values
% uniformly within each type of simplex, another to randomly
% select representatives of each different type of simplex in
% proportion to their volume, and a third to perform random
% permutations to provide an even distribution of simplex choices
% among like types. For example, with n equal to 3 and s set at,
% say, 40% of the way from a towards b, there will be 2 different
% types of simplex, in this case triangles, each with its own
% area, and 6 different versions of each from permutations, for
% a total of 12 triangles, and these all fit together to form a
% particular planar non-regular hexagon in 3 dimensions, with v
% returned set equal to the hexagon's area.
%
% Roger Stafford - Jan. 19, 2006
% Check the arguments.
if (m~=round(m))|(n~=round(n))|(m<0)|(n<1)
error('n must be a whole number and m a non-negative integer.')
elseif (s<n*a)|(s>n*b)|(a>=b)
error('Inequalities n*a <= s <= n*b and a < b must hold.')
end
% Rescale to a unit cube: 0 <= x(i) <= 1
s = (s-n*a)/(b-a);
% Construct the transition probability table, t.
% t(i,j) will be utilized only in the region where j <= i + 1.
k = max(min(floor(s),n-1),0); % Must have 0 <= k <= n-1
s = max(min(s,k+1),k); % Must have k <= s <= k+1
s1 = s - [k:-1:k-n+1]; % s1 & s2 will never be negative
s2 = [k+n:-1:k+1] - s;
w = zeros(n,n+1); w(1,2) = realmax; % Scale for full 'double' range
t = zeros(n-1,n);
tiny = 2^(-1074); % The smallest positive matlab 'double' no.
for i = 2:n
tmp1 = w(i-1,2:i+1).*s1(1:i)/i;
tmp2 = w(i-1,1:i).*s2(n-i+1:n)/i;
w(i,2:i+1) = tmp1 + tmp2;
tmp3 = w(i,2:i+1) + tiny; % In case tmp1 & tmp2 are both 0,
tmp4 = (s2(n-i+1:n) > s1(1:i)); % then t is 0 on left & 1 on right
t(i-1,1:i) = (tmp2./tmp3).*tmp4 + (1-tmp1./tmp3).*(~tmp4);
end
% Derive the polytope volume v from the appropriate
% element in the bottom row of w.
v = n^(3/2)*(w(n,k+2)/realmax)*(b-a)^(n-1);
% Now compute the matrix x.
x = zeros(n,m);
if m == 0, return, end % If m is zero, quit with x = []
rt = rand(n-1,m); % For random selection of simplex type
rs = rand(n-1,m); % For random location within a simplex
s = repmat(s,1,m);
j = repmat(k+1,1,m); % For indexing in the t table
sm = zeros(1,m); pr = ones(1,m); % Start with sum zero & product 1
for i = n-1:-1:1 % Work backwards in the t table
e = (rt(n-i,:)<=t(i,j)); % Use rt to choose a transition
sx = rs(n-i,:).^(1/i); % Use rs to compute next simplex coord.
sm = sm + (1-sx).*pr.*s/(i+1); % Update sum
pr = sx.*pr; % Update product
x(n-i,:) = sm + pr.*e; % Calculate x using simplex coords.
s = s - e; j = j - e; % Transition adjustment
end
x(n,:) = sm + pr.*s; % Compute the last x
% Randomly permute the order in the columns of x and rescale.
rp = rand(n,m); % Use rp to carry out a matrix 'randperm'
[ig,p] = sort(rp); % The values placed in ig are ignored
x = (b-a)*x(p+repmat([0:n:n*(m-1)],n,1))+a; % Permute & rescale x
end
  6 Commenti
Paul
Paul il 15 Lug 2023
This implemenation looks a little better if allowing some tolerance on the comparison to one. Or maybe use sym and then convert to double after elimination? And of course the V variables are in a nice ndgrid pattern, so an additional random step would be needed to select a column of x, or the columns would have to be scrambled before selecting sequentially. In the plots below I added the theoretical pdf on the histograms.
v1 = 0:0.01:1;
v2 = 0:0.01:1;
v3 = 0:0.01:1;
v4 = 0:0.01:1;
n = numel(v1);
[V1,V2,V3,V4] = ndgrid(v1,v2,v3,v4);
idx = find(abs(V1+V2+V3+V4-1) <= 1e-12);
V1 = V1(idx);
V2 = V2(idx);
V3 = V3(idx);
V4 = V4(idx);
x = [V1,V2,V3,V4].';
fz = @(z) 3*(z-1).^2; % theoretical pdf
for ii=1:4
subplot(2,2,ii)
histogram(x(ii,:),"FaceAlpha",0.5,'Normalization','pdf')
hold on
plot(0:.01:1,fz(0:.01:1),'LineWidth',2)
legend('After Normalisation')
end

Accedi per commentare.

Più risposte (1)

John D'Errico
John D'Errico il 14 Lug 2023
This is a mistake I see made so frequently, that is, a misunderstanding of what it means for a sample to be uniformly distributed, but with a sum constraint. The two ideas are sort of at odds, since a sample cannot be fully uniformly distributed the way we expect, once that constraint enters into the problem.
One thought is to start in 2-dimensions. So we have x1 and x2, uniformly distributed. but we want the sum to be 1. An answer is simple, we just choose x1 to be uniform on the interval [0,1], and then compute x2=1-x1. So this is easy. We can think of the result as choosing a sample uniformly along the straight line: x1+x2=1.
x1 = rand(1,500);
x2 = 1 - x1;
plot(x1,x2,'o')
You can create the desired array now, as
X = [x1;x2];
So the columns of X are uniformly distributed, under a sum constraint. Sadly, things get messier if we want to live in 3-dimensions. We cannot simply choose x1 and x2 independently now, because some of the time, x1+x2 will be greater than 1.
The randfixedsum code solves the problem elegantly, by sampling correctly so the set ois uniform, but it will lie in a TRIANGLE. Try it.
X = randfixedsum(3,1000,1,0,1);
plot3(X(1,:),X(2,:),X(3,:),'o')
box on
grid on
axis equal
view(48,28)
Now the points are seen to be uniformly distributed inside a TRIANGLE. However, you cannot now look at the marginal distrbution of any single variable, and hope them to also look uniform. This is your mistake. In fact, the histogram will now look like a trianglular distribution, with most of the samples near zero.
histogram(X(1,:),50)
Now, try the same thing in for 10 rows. I'll use a larger sample this time to make the histogram look pretty.
X = randfixedsum(10,100000,1,0,1);
size(X)
ans = 1×2
10 100000
histogram(X(1,:),50,'normalization','pdf')
Again, just because the sample is uniform in one respect, does NOT mean the marginal distributiuon should also be uniform. If we could plot the sample with a 10 dimensionsal monitor (sorry, I don't have one) then the plot would agsin be seen to be uniformly sampled inside a 10-dimensional simplex (an analogue of a triangle, but in 10-dimensions.)
You CANNOT have everything your intuition demands. Sorry, but too often a simple intuition runs counter to mathematical fact.
  1 Commento
Simone A.
Simone A. il 15 Lug 2023
Hi @John D'Errico, thanks a lot for taking the time to explain the reationale behind it, that made everything much easier and clearer. Once again, thanks a lot!

Accedi per commentare.

Prodotti


Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by