The proper way to sample 3 normally or lognormal distributed variables added up to 1

Hello team,
Is there any way to generate 3 normally or lognormal distributed variables added up to 1?
For example, I have a human tissue with a volume as 1 L. And there are three components vascular space (R_vas), interstitial space (R_int), cellular space (R_cell) to composite this tissue. Thus, in this case, R_vas + R_int + R_cell = 1.
To further consider variation in different human indivisual, I would like to normally or lognormally sample these three parameters (R_vas, R_int and R_cell have a mean 0.1, 0.1, 0.8, and coefficient variation 0.3, 0.3, 0.3, respectively) but ensure that these three parameters add up to 1.
I don't think I could just randomly sample two of the parameters and use 1 to minus the sum of the two randomly smapled parameters to get the third one. In this case, I think the third parameter is not followed the predefined mean and coefficient variation. or is it?
How could I achieve this task?
It will be great to have your help.

 Risposta accettata

Let me answer your second question separately. How would you sample three NORMALLY distributed random variables that sum to 1? This part is easy, since it merely requires you to properly construct the necessary covariance matrix.
First, assume we have three variables, with means and standard deviations:
mu = [.1 .1 .8];
sigma = [.3 .3 .3];
But really, we need a covariance matrix that is not a diagonal one. I'll construct it as if I knew the eigenvalue decomposition.
Q = [null([1 1 1]),[1;1;1]/sqrt(3)]
Q = 3×3
-0.5774 -0.5774 0.5774 0.7887 -0.2113 0.5774 -0.2113 0.7887 0.5774
C = Q*diag(3/2*[.3^2 .3^2 0])*Q'
C = 3×3
0.0900 -0.0450 -0.0450 -0.0450 0.0900 -0.0450 -0.0450 -0.0450 0.0900
As you can see, the VARIANCES (so the sqrt of the diagonal elements) of the covariance matrix are each 0.3. But they are now correlated.
Now we can generate set of numbers that have the desired property, at least over the long term, and within floating point trash. What I did was to carefully construct a singular covariance matrix.
X = mvnrnd(mu,C,10000);
hist(sum(X,2))
mean(X)
ans = 1×3
0.1037 0.1011 0.7952
std(X)
ans = 1×3
0.2987 0.3059 0.2997
format long g
[min(sum(X,2)),max(sum(X,2))]
ans = 1×2
0.9999999675926 1.00000003250349
So to within floating point trash, they sum to 1. At least as close as we can come based on double precision arithmetic in what I did. They have the desired means and variances.
Note that these variables have a problem perhaps in what you are doing, in that some of those elements will often be negative.
hist(min(X,[],2))
That is a likely event when the sum is required to be 1, and you are asking about NORMALly distributed random variables. This is also a reason why your goal MUST fail for the lognormal case.

8 Commenti

Would this technique work if the specified variances of the RVs are all different? For example, can the appropriate C be constructed with variances of 0.3^2, 1.0^2, and 1.5^2?
You cannot specify freely variances of the three components of RVs since RV(3) = 1-RV(1)-RV(2), so
var(RV(3)) = var(RV(1)) + var(RV(2)) + 2*E(RV(1)*RV(2))
E(.) is the expectation. Note that abs(E(RV(1)*RV(2))) <= std(RV(1)) * std(RV(2))
Same equation as this for any permutations of 1-2-3.
However you can specify any non-negative symmetric submatrix 2x2, the expansion to full 3x3 covariance matrix then can be uniquely determined from the equation sum(RV)=1.
I agree with you that we can't freely specify variance for all three components. However, I'm not sure yet I agree with that formula that relates the variances for all three components. Example
C = [2.590000000000000e+00 -2.015000000000000e+00 -5.750000000000000e-01
-2.015000000000000e+00 1.690000000000000e+00 3.250000000000000e-01
-5.750000000000000e-01 3.250000000000000e-01 2.500000000000000e-01];
mu = [2.000000000000000e-01 3.000000000000000e-01 5.000000000000000e-01];
rng(145);
X = mvnrnd(mu,C,1e5);
histogram(sum(X,2)); % verify sum = 1
format long e
[var(X(:,1)) + var(X(:,2)) + 2*mean(X(:,1).*X(:,2)) , var(X(:,3))]
ans = 1×2
3.680143748260329e-01 2.485406908448432e-01
I think the correct formula is (have to subtract off the expected values in the third term in the sum)
[var(X(:,1)) + var(X(:,2)) + 2*mean((X(:,1)-mean(X(:,1))).*(X(:,2)-mean(X(:,2)))) , var(X(:,3))]
ans = 1×2
2.485808586014633e-01 2.485406908448432e-01
Yes I have in mind zero-mean varables -wrong - when I wrote the formula.
Hello John,
Thank you for the detailed explanations for my misunderstanding and provide a solution that did quite a good job for this case.
Jesse
I would attack the problem this way
N = [eye(2);[-1 -1]];
N forms a basis for the null space of [1 1 1]
[1 1 1]*N
ans = 1×2
0 0
Define C as follows
syms sigma_x sigma_y rho_xy
C = N*[sigma_x^2 rho_xy*sigma_x*sigma_y;rho_xy*sigma_x*sigma_y sigma_y^2]*N.'
C = 
We see that we can define the covariance of X and Y (upper 2x2 block), but we get what we get for the variance of Z and its correlation with the other two variables.
Example from @John D'Errico formed by
C = N*[.3^2 -0.5*0.3*.3;-0.5*.3*.3 .3^2]*N'
C = 3×3
0.0900 -0.0450 -0.0450 -0.0450 0.0900 -0.0450 -0.0450 -0.0450 0.0900
But we can pick other parameters. For example
C = N*[0.3^2 0.7*0.3*1.3;0.7*0.3*1.3 1.3^2]*N'
C = 3×3
0.0900 0.2730 -0.3630 0.2730 1.6900 -1.9630 -0.3630 -1.9630 2.3260
mu = [0.2 0.3 0.5]; % must sum to 1
histogram(sum(mvnrnd(mu,C,1e5),2));
Geometrically, the ellipsoid is collapsing down to an ellipse centered on mu and lying in the plane x+y+z=1. There's probably even a further degenerate case where the ellipsoid collapses to a line going through my and lying in that plane.
The problem is - as John pointed out - x, y, z ca get negative values. If that negative quantities has physical interpretation for fractional volume of human tissu then it's OK.
As John did, I'm just addressing the question of how to specify mu and C such that mvnrnd(mu,C,n) returns values that sum to unity. Maybe that's not the correct question to ask for what the OP is really interested in.

Accedi per commentare.

Più risposte (3)

You can define a common probability distribution of the three variables on the triangle
x + y + z = 1, x, y, z >= 0
but this cannot be a normal or lognormal distribution for each of the random variables as you suggested.
You can also divide the result of individual sampling by R_vas + R_int + R_cell, but this will change the assumed distributions for R_vas, R_int and R_cell.
Magic?
You have three variables, with means that will at least get you in the right ballpark. The goal however, its to insure the sum is exactly 1. What property does a lognormal variable have? You can generate one by generating a normally distributed random variable, and then exponentiating it. So effectively, the log of a lognormal variate, is Normally distributed.
Now, if your goal was to find three lognormally distributed variables, where the PRODUCT was 1, this problem would be far easier. You find a set of three normal variables where the sum was zero. When you exponentiate, the product is automatically 1. And finding a set of normal variates with a sum of zero is almost trivial. (No reason to get into that here, as it is not pertinent to your problem.)
Anyway, you are correct in that you CANNOT just sample two variates, then choose the third to force that sum to 1. But let me back up. Does this entire question even make mathematical sense? Sadly, no. Remember that a lognormal variable lives on the interval (0,inf). So there is a strong chance that any one of those variables themselves are greater than 1. And when that happens, the sum can NEVER be 1 because the other two terms in the sum can never be negative. So you cannot have a situation where the sum is constrained to be a constant. Sorry.
At best, you could think about a set of TRUNCATED lognormal variables, that sum to 1. And even then, the truncation point would be difficult to quantify.
This will generate 3 random variables positives and sum to 1, with mean 8/10, 1/10, 1/10:
nu=[8 1 1];
B=arrayfun(@(n) ones(1,n),nu,'unif', 0);
A=blkdiag(B{:});
m=size(A,2);
p=3; % larger p will give smaller std
% this two lines will generate proper uniform conditioninf distributions,
% row-sum to 1 for all p columns
r=-log(rand(m,p));
r = sum(r./sum(r,1),2)/p;
%
r=A*r;
r
r = 3×1
0.7922 0.1397 0.0680
It will NOT be normal, but it kind of "normal" in the sense of central limit theorem, i.e. a limit of sum of a "large" independent random variables.

2 Commenti

Here is the histogram of the results using my code. It will not Gaussian curve but skewed bell shape:
When you increase p, the histogram will be narrower but less skewed.
nu=[8 1 1];
B=arrayfun(@(n) ones(1,n),nu,'unif', 0);
A=blkdiag(B{:});
m=size(A,2);
p=3; % larger p will give smaller std
rtab = zeros(size(A,1),1e6);
for k=1:size(rtab,2)
% this two lines will generate proper uniform conditioninf distributions,
% row-sum to 1 for all p columns
r=-log(rand(m,p));
r = sum(r./sum(r,1),2)/p;
%
rtab(:,k)=A*r;
end
figure
subplot(2,2,1)
histogram(rtab(1,:))
subplot(2,2,3)
histogram(rtab(2,:))
subplot(2,2,4)
histogram(rtab(3,:))
Run with p = 20
Hello Bruno,
Thank you for the solutions.
This somehow could solve the problem.
Jesse

Accedi per commentare.

Prodotti

Release

R2022b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by