Random triplets not asymmetrically distributed

3 visualizzazioni (ultimi 30 giorni)
I would like to generate many triplets whose values are drawn from {0.1, 0.2, ... 0.9}. I also want each triplet to sum to 1.0. I use the following code to generate them. However, when I individually histogram the 1st, 2nd, and 3rd element across all triplets, they are nonuniform and asymmetric.
Can anyone explain why the distribution is nonuniform? I can sort of guess that it is due to the constraint that they sum to 1.0.
However, it seems odd that the distribution is skewed. There should be no difference in the upward direction toward 1.0 or the downward direction toward 0.0, as the problem is completely symmetric.
The only possible cause for asymmetry is that rounding is slightly asymmetric, but just by smidgeon.
nRows=1e7; % Number of triplets
w = rand(nRows,3); % Generate row triplets
w = w./sum(w,2); % Normalize each triplet to sum to 1.0
w = round(w,1); % Round to 1 decimal place
w = w( sum(w,2) == 1.0 , : ); % Retain triplets that sum to 1.0
w = w( all(w,2) , : ); % Discard triplets containing zero
% Individually histogram 1st, 2nd, and 3rd element across all
% triplets
for iw = 1:3
subplot(3,1,iw)
histogram( w(:,iw), 0.05:0.1:.95 )
end % for iw
mean(w)
ans = 1×3
0.3259 0.3258 0.3483
  2 Commenti
the cyclist
the cyclist il 12 Apr 2022
Modificato: the cyclist il 12 Apr 2022
I ran your code in the interface here.
What is puzzling to me is not the skewness of the distributions themselves, but rather the fact that the distribution of the third column is different from the 1st and 2nd columns. Note that I appended a calculation of the mean value of each column, and they are not equal. (It's not just sampling error.)
Maybe I'm overlooking something obvious, but I see no point in the code in which the columns of w are not treated identically from a coding perspective. Superficially, I feel that this algorithm should work.
The line that causes the problem is the one with the sum:
w = w( sum(w,2) == 1.0 , : ); % Retain triplets that sum to 1.0
If you shuffle the values within each row after that point, you will get the expected result:
nRows=1e6; % Number of triplets
w = rand(nRows,3); % Generate row triplets
w = w./sum(w,2); % Normalize each triplet to sum to 1.0
w = round(w,1); % Round to 1 decimal place
w = w( sum(w,2) == 1.0 , : ); % Retain triplets that sum to 1.0
for nr = 1:size(w,1)
w(nr,:) = w(nr,randsample(3,3));
end
w = w( all(w,2) , : ); % Discard triplets containing zero
% Individually histogram 1st, 2nd, and 3rd element across all
% triplets
figure
for iw = 1:3
subplot(3,1,iw)
histogram( w(:,iw), -0.05:0.1:1.05 )
end % for iw
mean(w)
ans = 1×3
0.3336 0.3332 0.3332
I speculate that MATLAB is doing something clever when evaluating
sum(w,2) == 1.0
that leads to w(:,3) being the "odd man out".
FM
FM il 12 Apr 2022
Modificato: FM il 12 Apr 2022
Thank you, "the cyclist". My experiments show that you are on the right track, though it's not MATLAB's cleverness so much as my forgetting numerical basics. Here are the troubleshooting steps:
nRows=1e7; % Number of triplets
w = rand(nRows,3); % Generate row triplets
w = w./sum(w,2); % Normalize each triplet to sum to 1.0
w = round(w,1); % Round to 1 decimal place
TripletSums = sum(w,2); % Triplet sums
ifSum1 = TripletSums == 1.0; % Triplet rows that sum to 1.0
% Confirm that rows apparently sum to 1.0 actually do
[ min(TripletSums(ifSum1)) max(TripletSums(ifSum1)) ] % [1 1] yay!
% Check if rows not summing to 1.0 actually don't
diff1 = abs( 1.0 - TripletSums(~ifSum1) ); % Differences from 1.0
[ min(diff1) max(diff1) ] % [0 1] oh no!
What this shows is that some of the rows that apparently don't sum to 1.0 actually do, since the difference is zero. At least one source of this problem is the identification of rows using "ifSum1 = TripletSums==1.0", which probably leaves out rows that are slightly different than 1.0 due to numerical noise. To correct this, note that the decimal tails of the numbers in the triplets contain only 1 digit, so any numerical discrepancy between the sum and 1.0 should be way less than 0.01. The new definition of ifSum1 checks for this:
nRows=1e7; % Number of triplets
w = rand(nRows,3); % Generate row triplets
w = w./sum(w,2); % Normalize each triplet to sum to 1.0
w = round(w,1); % Round to 1 decimal place
TripletSums = sum(w,2); % Triplet sums
ifSum1 = abs(TripletSums - 1.0) < 0.01 ; % Triplet rows that sum to 1.0
% Confirm that rows apparently sum to 1.0 actually do
[ min(TripletSums(ifSum1)) max(TripletSums(ifSum1)) ] % [1 1] yay!
% Check if rows not summing to 1.0 actually don't
diff1 = abs( 1.0 - TripletSums(~ifSum1) ); % Differences from 1.0
[ min(diff1) max(diff1) ] % [0.1 0.1] i.e. differs from 1.0
w = w(ifSum1,:); % Retain triplets that sum to 1.0
w = w( all(w,2) , : ); % Discard triplets containing zero
mean(w) % [ 0.3333 0.3334 0.3333 ] sanely uniform (yay!)
% Individually histogram 1st, 2nd, and 3rd element across all
% triplets
for iw = 1:3
subplot(3,1,iw)
histogram( w(:,iw), 0.05:0.1:.95 )
end % for iw
Unfortunately, the distribution is still skewed.

Accedi per commentare.

Risposta accettata

Walter Roberson
Walter Roberson il 12 Apr 2022
Please look in the File Exchange to get randfixsum() by Roger Stafford https://www.mathworks.com/matlabcentral/fileexchange/9700-random-vectors-with-fixed-sum
  4 Commenti
Bruno Luong
Bruno Luong il 12 Apr 2022
This answer is not suitable for discrete case, which is entirely different beast if uniformity is required.
FM
FM il 12 Apr 2022
Modificato: FM il 12 Apr 2022
@John D'Errico + @Bruno Luong : Yes, that's what I got from John's answer. I only browsed the paper in the recommended package, but in that short browse, I did not notice anything about the discrete nature. Thank you for confirming this.

Accedi per commentare.

Più risposte (1)

John D'Errico
John D'Errico il 12 Apr 2022
Modificato: John D'Errico il 12 Apr 2022
If you wish to sample from the set of triplets that sum to exactly 1, where each elemnt is discrete, then you do NOT want to use randfized sum. In fact, you don't want to do it by rejection either!!!!!!!!
The set of triples that sum to 1 is trivially generated. I will just generate the entire set by brute force. Simplest is like this (Note that I am using integers initially. At the end, I'll divide by 10.) Next, you should see that the MAXIMUM element must ALWAYS be 0.8, NOT 0.9. There is no way to add THREE numbers from the set (1:9)/10, and have one of them be 0.9, with the sum as 1. That should be obvious.
V = 1:8;
[V1,V2,V3] = ndgrid(V,V,V);
V123 = [V1(:),V2(:),V3(:)];
V123(sum(V123,2) ~= 10,:) = [];
size(V123,1)
ans = 36
So there are EXACTLY 36 possible ways to form that sum. No more, no less.
V123
V123 = 36×3
8 1 1 7 2 1 6 3 1 5 4 1 4 5 1 3 6 1 2 7 1 1 8 1 7 1 2 6 2 2
At the end, divide by 10.
V123 = V123/10;
Those are the ONLY ways to form the sum you want.
Now if you want to generate random triples, just generate a random integer from 1 through 36. Use that to sample from the rows of V123.
Nsets = 10000;
ind = randi([1,36],[Nsets,1]);
triples = V123(ind,:);
histogram(triples(:,1),100)
histogram(triples(:,2),100)
histogram(triples(:,3),100)
Now, I suppose that you MIGHT decide this does not make sense, that we should expect to have equally as many samples with .1 as we have .6, .7, or .8. But of course that is silly. In fact, we should expect a non-uniform distribution as we see, for each channel. Think about it.
For example, how many ways are there to get a sum that includes one element that is 0.8? EXACTLY 3 ways, that is... {[0.1 0.1 0.8], [0.1 0.8 0.1], [0.8 0.1 0.1]}.
sum(V123 == 0.8,1)
ans = 1×3
1 1 1
But now, how many ways are there for the sum to have 0.1 in it?
sum(V123 == 0.1,1)
ans = 1×3
8 8 8
So there are 8 times as many sums that include the number 0.1 in each member of the triple, than there are ways to find the number 0.8.
  3 Commenti
John D'Errico
John D'Errico il 12 Apr 2022
(Um, you can change which answer you have accepted. But i really don't care that much about being accepted. Reputation is worth nothing to me. When I die, I hope my obituary has something better to say about me than quote my reputation on a web site.)
People misunderstand what uniform would mean in a case like this.
The constraint that the numbers sum to 1 implies all sets of points MUST lie in a plane. And since they are all positive, the constraint region becomes a triangle.
V = 1:8;
[V1,V2,V3] = ndgrid(V,V,V);
V123 = [V1(:),V2(:),V3(:)];
V123(sum(V123,2) ~= 10,:) = [];
plot3(V123(:,1),V123(:,2),V123(:,3),'o')
axis equal
box on
grid on
Do you see that all points lie in a planar triangle? How about if I rotate the plot?
plot3(V123(:,1),V123(:,2),V123(:,3),'o')
axis equal
box on
grid on
view(-24,53)
The points all lie in a triangle. They are uniformly scattered in the triangle. But that means if you look at marginals, so looking at one variable only, the histogram you see will NOT be uniform.
FM
FM il 12 Apr 2022
Actually, the reason why I wish that both answers can be accepted is that both are informative. One points to a paper that sheds light on the geometric nature of the problem (which I have admittedly yet to plumb). Your answer hits on the discrete and finite nature of the options, providing code that illustrates this.
I really appreciate your explanation of the planar and triangular arrangements. It's so obvious after you explained, but I've been away from basic linear algebra for a long time. It was dead simple to dump the code into MATLAB, create the 3D plot, and grab and rotate the plot, which is very revealing.

Accedi per commentare.

Prodotti


Release

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by