how to generate a random vector with fixed sum and bounded elements

I want to generate a random N element vector with a fixed sum: and each element is bounded as . is the fixed sum and specify upper bound on the element of the vector . I would appreciate any guidance.

 Risposta accettata

Maybe I am not understanding something, but it seems like your problem is overconstrained.
You can easily make a length n vector whose elements are bounded by d, for example using a uniform distribution
x = rand(N,1).*d % where d is a vector of upper bounds for each element
you can also make a random vector with a specified sum using
x = rand(N,1);
xfixedSum = x/sum(x)*Etotal
But it seems that in general it is not possible to hold both constraints.

12 Commenti

The answer addresses the problem partially but need further guidance if it is really a over constrained problem and any hint on relaxation.
Jon
Jon il 25 Feb 2020
Modificato: Jon il 25 Feb 2020
I think that it is clear that your problem as described is over constrained. For example suppose N = 3, Etotal = 10, and d1 = d2 =d3 =1. Clearly the largest total you could achieve would be 3 which is less than the required value of 10
Thanks again for your time. I think I missed some thing in explainaning the problem. The problem is set in a way that so there should always be a solution. Is this still an over-constrained case?
So the problem is not necessarily over constrained, it all depends upon your choice of Etotal and d.
A necessary condition to have a solution is for sum(d) >= Etotal, that is if the components are too small there is no way they can sum up to Etotal.
Here is one approach to generating a set of uniformly distributed random vectors that satisfy the constraints.
I just have it here as a script, but you could turn this into a function if that is what you need.
% generate uniformly distributed random vectors,e satifying
% sum(e) = Etotal
% 0 <= e <= d
% use rejection method
numNeeded = 10; % number of random vectors that are needed
Etotal = 6;
d = [1; 4; 3]; % upper bound on components
maxTries = 1e6; % maximum number of attempts to find feasible vectors
N = length(d); % number of components
% make sure that it is possible to find solutions
if sum(d) < Etotal
error('No solution possible, try increasing components of d')
end
% enter loop that generates trial points until it finds enough
% that satisfy the constraints
% preallocate array to hold vectors satisfying constraints
% each column of E is a vector satisfying the constraints
E = zeros(N,numNeeded);
count = 0;
numTries = 0;
while count < numNeeded & numTries < maxTries
% make random length N-1 vector that satisfies bounds (inequality
% constraints)
e = rand(N-1,1).*d(1:N-1); % scale the components so 0<=eTrial<=d
% calculate required last component to satisfy the equality constraint
% on the sum
eN = Etotal - sum(e);
% check if last component satisfies bounds
if 0 <= eN && eN <= d(N)
% satisfies constraints, add it to the output
count = count + 1;
E(:,count) = [e;eN];
end
% increment try counter
numTries = numTries + 1;
end
% check if enough solutions were found
if count < numNeeded
error(['maximum tries were exceeded,' ...
' try increasing bounds on d or maximum allowed number of tries'])
end
% display result
disp(E)
Digging a little more, here is an interesting, but very technical, paper that discusses this problem in case it is of interesthttp://www.cs.cmu.edu/~nasmith/papers/smith+tromble.tr04.pdf
Jon, once again thank you very much for the kind help, especially the code and the refrence material. Now I am able to generate desired output. But, the approach suffers badly as the difference () narrows and value of N increases. Is adaptive rejection sample a better choice here or any other approach?
Yes, I can imagine that this somewhat "brute force" rejection method becomes inefficient as Etotal - sum(d) narrows and value of N increases.
I'm not sure how critical the performance is for your application, and how bad it gets. Of course if the run time is still acceptable, you could live with the inefficent algorithm, but it would of course be nicer to have a better one.
I'm sorry, but I do not have any special experience regarding algorithms for this purpose, and so can't offer you additional help there. If you find one that you would like to implement, and have issues with your MATLAB implementation of the algorithm you could definitely post that and would likely be able to get additional help.
Good luck with your project
Jon, I am indebted to your help, infact it gave me a big jump start. The solution you provided serves my purpose atleat in one part of the problem. I trying to find some workaround.
Hi Saifullah,
This problem has been intriguing me, and I think I developed a nice solution, if you still need it. I've attached some example code with a lot of comments but here's the overview.
We are trying to sample points on the intersection of the hyperplane defined by sum(e) = Etotal and the bounding box defined by 0 < e < d. The idea is to generate only points that lie on the hyperplane and are inside of the bounding box. We do this by first finding a particular solution that is on the hyperplane and inside of the bounding box. Using this as an origin we then generate points that are on the hyperplane using random vectors that are in the null space of [1 1 1 1 ... 1]*e = Etotal. We scale these to ensure that they do not extend beyond the boundaries of the bounding box.
The particular solution e0 is easily found when we recognize that the origin and the maximum corner of the bounding box are on opposite sides of the hyperplane. We can see this by noting that for the origin, the sum([0 0 0 0...0])<=Etotal,and for the maximum bounding box corner, sum([d(1) d(2) d(3) ...])>=Etotal. So the line connecting these two points, which is fully in the interior of the bounding box, must intersect the hyperplane, somewhere inside of the bounding box. The vector, whose tip is the intersection point, is easily solved for as e0 = Etotal/sum(d)*d, this gives the particular solution we need to get started.
Please give the attached example a try, and play around with different bounding boxes and Etotals. I'm not sure if I've covered all edge cases, but it's worked for all of the examples I've tried. Please let me know if you have any questions.
I would guess that I may have reinvented the wheel here, and that this approach has been documented elsewhere but I didn't quickly come across any references which used this algorithm. I'd be interested if you find one.
% generate uniformly distributed random vectors,e satifying
% sum(e) = Etotal
% 0 <= e <= d
% Generate only feasible points using null space bounding box calculation
% Jon Simon 2020-03-04
numNeeded = 10000; % number of random vectors that are needed
Etotal = 7;
d = [1; 0.5; 1; 0.9; 0.25; 1; 0.5; 2]; % upper bound on components
N = length(d); % number of components
% make sure that it is possible to find solutions
if sum(d) < Etotal
error('No solution possible, try increasing components of d')
end
% find particular solution in the interior of the bounding box
% define vector that extends from origin to maximal corner of bounding box
v = d;
% find intersection of vector with hyperplane where sum(e) = Etotal
e0 = Etotal/sum(d)*v;
% find null space of solutions
A = null(ones(1,N));
% translate coordinates of bounding box
% to a frame where origin is at e0
dU = d - e0;
dL = -e0; % original lower bound is zero
Bbnd = [dL dU]; % box bounds in translated coordinates
% generate random feasible point that are on the hyperplane and in the
% interior of the bounding box
E = zeros(N,numNeeded); % preallocate array to hold vectors
for k = 1:numNeeded
% find random vector in null space
w = 2*rand(N-1,1)-1; % ranges from -1 to 1
e = A*w;
% random point will be found in direction of e, but need to determine
% allowable scaling so that it will be inside of the bounding box
% for each component of e find matrix of scaling factors Alpha, where
% Alpha(i,j)*e = Bbnd(i,j); where Alpha(i,1) is the scale factor
% that would bring the ith component of Alpha(i,1)*e to the ith lower bound
% and Alpha(i,2) is the scale factor that would bring the ith component
% of Alpha(i,2)*e to the ith upper bound
Alpha = Bbnd./e;
% find the allowable range of alpha, note that since 0*e is a point that is
% on the hyperplane and in the interior of the bounding box, alpha = 0 is always in
% allowable range so the lower bound on alpha is a negative value and
% the upper bound is a positive value
alphaL = max(Alpha(Alpha(:)<=0)); % greatest lower bound on Alpha, use (:) to look at elements as column
alphaU = min(Alpha(Alpha(:)>=0)); % least upper bound on Alpha
% randomly choose a value of the scale factor that is within the
% allowable range
alpha = alphaL + rand*(alphaU-alphaL);
% generate point on hyperplane that is inside of bounding box
E(:,k) = e0 + alpha*e ; % particular solution plus null space vector;
end
% check
colSums = sum(E) % gives column sums, should all equal Etotal
minComp = min(E,[],2) % gives minimum of each component over all of the points in E, should all be >= zero
maxComp = max(E,[],2) % gives maximum component over all of the points in E, should all be <= d
% display sum(d)-Etotal, which gives indication of difficulty of proble
disp({'sum(d)-Etotal = ',sum(d)-Etotal})
% if successful all of these should be true
check1 = all(abs(colSums -Etotal) < 1e-10) % allow some tolerance
check2 = all(minComp>=0)
check3 = all(maxComp<=d)
I guess I was too late to see the question posed, but yes, it is quite possible to solve the problem where bounds are applied, and you have a sum constraint. As well, the problem is quite efficiently solvable, using the correct algorithms.
Jon
Jon il 4 Mar 2020
Modificato: Jon il 4 Mar 2020
@John Please look at my last algorithm, that I just posted before your comment. I'm sure it could be further optimized but overall I think it is quite efficient, with no rejection and very few operations.
Jon, Thank you very much for all your kind support. I spent some time to understand the solution yesterday, as things were well beyond my mathematical knowhow. Thanks to your tons of comments about every step.Now, I am trying to establish the randomness properties of the generated data and also want to understand the underlying mathn in more formal way. I wish if John or you can kindly comment on these aspects.

Accedi per commentare.

Più risposte (2)

If your bounds are the same for all elements, then use randfixedsum. It is amazingly efficient, even for high dimensional problems. The number of dimensions is N in this sort of problem.
For example, to find 3 vectors, of length 10, such that the samples are uniformly distributed over the set of possible solutions, with a sum of 1, lower bound of 0, and upper bound of 1...
X = randfixedsum(10,3,1,0,1)
X =
0.11264 0.16122 0.1933
0.090028 0.13915 0.076298
0.10823 0.02528 0.024352
0.15859 0.31793 0.028457
0.062117 0.09639 0.056485
0.013329 0.051958 0.025118
0.10416 0.058997 0.3023
0.010921 0.021139 0.11247
0.21689 0.0064835 0.040133
0.12309 0.12146 0.14109
sum(X,1)
ans =
1 1 1
As I said, randfixedsum is quite efficient. The time to find 500 vectors, eah of length 1000 is about as good as you can possibly home for.
timeit(@() randfixedsum(1000,500,1,0,1))
ans =
0.062201
So only .06 seconds for the call. Find randfixedsum on the file exchange, here:
A problem arises when your bounds are not the same for all elements of the resulting vectors. Or, if the linear combination chosen is not just a simple sum. In these cases, as long as N is not too large, then use randfixedlinear combination.
For exampel, suppose we want to find vectors of length 5, such that the sum is also 1, but where the elements have an upper bound as the vector [.1 .2 .3 .4 .5]?
X = randFixedLinearCombination(3,1,ones(1,5),0,[.1 .2 .3 .4 .5])
X =
0.034225 0.15026 0.21493 0.27429 0.32629
0.091852 0.049101 0.27878 0.37448 0.20579
0.032732 0.086649 0.23589 0.36987 0.27486
sum(X,2)
ans =
1
1
1
So the first element lives in [0, 0.1], the 5th element lives in [0, 0.5], etc.
randFixedLinearCombination is not as efficient when the dimension N gets large however.
For example, it is extremely good in a low number of dimensions:
timeit(@() randFixedLinearCombination(3,1,ones(1,10),0,1))
ans =
0.00069653
>> timeit(@() randFixedLinearCombination(3,1,ones(1,15),0,1))
ans =
0.0067253
>> timeit(@() randFixedLinearCombination(3,1,ones(1,20),0,1))
ans =
0.39572
But as I start to push things a bit, say 20 dimensions, things get slower.
Find randFixedLinearCombination here:
A virtue of both codes is they assue a result that is uniformly distributed over the domain of interest.
Be very careful however, as many solutions are not so good, especially ones that seem trivial. For example, you might consider the classic rescaling solution, something that someone on the site is always fast to propose.
X = rand(1,3);
X = X./sum(X)
X =
0.55967 0.0077895 0.43254
It seems REALLY easy to do. And I will admit, it is fast. One problem is, the result is a set of vectors that are not uniformly distributed over the domain of all possible solutions. Worse, they may not satisfy the bound constraints as required. So, if we try plotting the result of 10000 such samples, we would see...
X = rand(10000,3);
X = X./sum(X,2);
plot3(X(:,1),X(:,2),X(:,3),'.')
grid on
box on
The result is not uniformly distributed over that set. As you can see, the samples are considerably less dense in some regions of the implied constraint plane.
randfixedsum and randFixedLinearCombination have no such problem however.
Y = randFixedLinearCombination(10000,1,ones(1,3),0,1);
plot3(Y(:,1),Y(:,2),Y(:,3),'.')
grid on
box on
As I said, another problem is rescaling can mess up the bound constraints. That is never a problem with the two tools I have recommended however.

15 Commenti

Hi John,
In my last algorithm, please see above, I find a particular solution inside of the bounding box, and then find random points satsifying the bounding box constraints by adding appropriately scaled random vectors in the null space of [1 1 1 1 .. 1]. I just tried my alogorithm to compare to your example of
randFixedLinearCombination(3,1,ones(1,20),0,1))
which you said took about 0.4 seconds.
I used my code to to find 3, length 20, vectors, each of which sum to 1, and whose components are between 0 and 1, which I think is a a comparable problem. I find it only takes about 0.002 seconds, (on a good laptop) so I don't think it suffers to much from the high dimensionality. Actually I get similar execution times even if the feasible region is much smaller, so for example if I have the same bounds and required the components to sum to 19.
If you're interested to try it out, or to look at the correctness of the algorithm, I would value hearing about your experience. The one subtle point that I'm not sure how to assess is the extent to which the points I generate are uniformly distributed. Intuitively, it seems like they would be, but for really assessing this I am out of my depth. Note however that I don't just use a simple rescaling as you discuss in your post.
John, thank you very much for your response . In my problem, I require vectors longer than 20 dimensions. I wish if you can kindly look at the solution proposed by Jon both for randomness of the output and the formal maths behind the technique.
Jon - Time is very dependent on what is running in the background, of course. In fact, (sheepish admission) I was probably running Python flat out on the side, searching for huge prime numbers from a special family at the same time as I made that test. I'm midway through a 10 hour run right now, and I am often found to be running both MATLAB and Python engines flat out on my system. So my time estimate may have been unthinkingly wrong.
A uniform distribution in a high number of dimensions is difficult to assess. It is far easier to assess when you can plot things, as I did in 3-d. And it much depends on how much you need uniformity. But remember that a strongly non-uniform sample can be a problem, since it will result in parts of the sample space being heavily under-represented. Depending on what you are doing, it may be important, or just a so what? Are you doing a Monte Carlo integration (or something that is essentially equivalent to that)? If so, then non-uniformity can result in a biased estimate. Sometimes, you just want a few sets of numbers that look sort of random, but do need to satisfy some important property.
True uniformity is, as I said, sometimes difficult to predict. For example, we can visualize a reason why the simple rescaling solution fails for the sum==1 is that we sample initially from a unit hypercube using rand, but then project the samples down to a plane. Unfortunately, the plane we project onto is not a "good" choice for uniformity.
xy = rand(1000,2);
plot(xy(:,1),xy(:,2),'.')
hold on
line([0 1],[1 0])
axis equal
axis tight
The projection here is one that projects each point onto the diagonal line. As you can see, there are far fewer points that will yield a result near the ends of the line, than there are points that will land near the center. So now if we look at the distribution of the projected points along that line, it is not even remotely uniform.
I'll redo the sample here to be much larger so we get a better plot. It is simple here just to view the marginal distribution for x only:
xy = rand(1000000,2);
xyc = xy./sum(xy,2);
hist(xyc(:,1),1000)
However, suppose we just want to sample points that lie uniformly on the surface of a unit hyper-sphere in n-dimensions? Now the best scheme really is to do a simple re-scaling of a Gaussian sample.
nsamples = 1000;
n = 3;
X = randn(nsamples,n);
X = X./sqrt(sum(X.^2,2));
This works of course because the multi-variate Gaussian PDF is symmetric around the origin. But many people do not appreciate the difference, nor do they always care. ;-) The nice thing is, this is easy to show, but subtleties in a lengthy code can easily complicate the question.
@Saifullah - Really, we need to know if the difference is worth the time and energy spent to verify if the distribution is uniform. If you are willing to accept something that looks good enough for the task at hand, then there is no need to look further. It is crucial to know the goals of any programming task.
Jon gave you code that does produce a random result. I can fairly easily prove that the result of the code I wrote is uniform in n-dimensions, and I think I recall convincing myself of that for Roger Stafford's code, long ago. But if you want to show that what Jon did is valid in n-dimensions, a good start would be to look at what it produces in 2 or 3 dimensions. Does a low-dimensional case at least appear to be reasonably uniform? You can see the difference in the examples I've given in 3-dimensions. For 1e6 samples from randFixedLinearCombination (or randfixedsum, for that matter) we see a nice flat histogram.
X = randFixedLinearCombination(1000000,1,ones(1,2),0,1);
hist(X(:,1),1000)
So I would start with a simple case to at least give yourself a warm fuzzy feeling. (Mathematics for me is often preceded by a careful use of MATLAB to give myself that warm fuzzy, before I ever invest the energy to do any real thinking.) Does the 2-d or 3-d case result in something that at least looks reasonable? If not, AND that is important to you, then you need to spend the time to use a better method, or to improve the code proposed by Jon.
Hi John,
Thank you so much for your detailed reply.
I did a quick look as you suggested using lower dimensional cases and it would at least initially appear that my results are not uniformly distributed over the portion of the plane that satisfies the constraints .
Here are the results for
numNeeded = 10000; % number of random vectors that are needed
Etotal = 3; % components must sum to this value
d = [2; 0.5; 1;]; % upper bound on components, lower bound is zero for all components
The point seem denser near the particular solution that provides the local origin for generating my samples. (If you look closely you can see a small red * where the particular solution is located.
So it seems that my approach generates points that satisfy the constraints, but maybe does not provide uniform sampling.
I don't know how important this is for Saifullah's application, but in any case is disappointing to me, as I hoped I had found a nice way to produce uniform samples. I'm not sure if the approach could be modifed to improve this or if it is fundamental limitation. I could think about it a little more, but at the moment I don't really have a good grasp on what it is about how I'm generating the points that leads to the non-uniformity. If you had time to look, and had any insights I would certainly be appreciative.
Hmm. I took a quick read through your code. One compliment I'll offer - you used good comments. sufficient to make your code readable, so that I can easily see what you are doing. That, combined with the picture you provide are enough for me to see where you went wrong. I'm happy you took my advice about a low-dimensional test. This can often give some intuition into the problem, as I think it did. And here it does convince me that your scheme is non-uniform, as I wondered if it might be.
The discuccsion will take my usual lengthy explanation, done later today after I return from my bridge game. Life must go on, after all. ;-) But I might then even be able to offer a solution to make the general scheme chosen by you still work, and also be uniform.
Jon
Jon il 6 Mar 2020
Modificato: Jon il 6 Mar 2020
Here's what I think is an improved variant of my original idea. Here I generate pairs of randomly selected points on boundary of the interstection and then find randomly selected point that is convex combination of these two points. Please try the attached file if you have a chance.
In the earlier approach I first found a particular solution, x0, that was in the interior of the bounding box I would then find a vector, v with a random direction that was in the hyperplane satisfying the sum constraint. I would then randomly scale v (using a uniformly distributed length) so that when added to x0 it would be between x0 and the boundary. I think the problem with this regarding uniformity, is that for directions that were close to the boundary a randomly chosen point will be more likely to be near the origin than for directions that are far from the boundary. So in the end I would have more points close to the origin (x0) in the directions that were closer to the boundary.
In the modified approach, I generate a pair of randomly selected points on the boundary and then pick (using a uniform distribution) a point that is "between" these. Intuitively this seems like it should be better since we sample lines that cross all over the sampled region, some long some short. I think by eye it looks better, but this uniformity concept seems to be quite subtle, so maybe this approach also has shortcomings.
I have a feeling this must be something that mathematicians must have looked at in detail already and I'm probably somewhat blindly trying to reinvent the wheel here. Looking into this literature is a little daunting though, and I'm just looking at the problem because it intrigues me. I'm also not sure how rigorous the original OP needs the results to be and what is good enough. Anyhow I'd certainly be happy to hear your ideas.
Once again I am back, filled with gratitude for both of you for being exceptionally considerate and helpful. As far as the requirement of randomness in my original problem is concerned, I think a good enough approach will probably be ok as there is another algorithmic component that can help with it up to some extent. Also, I think all of this disussion can help anybody dealing with similar issues.
Hi Saifullah, Well hopefully the last approach attached above (hboxPlaneSampling.m) will be helpful. Anyhow you can give it a try. Maybe John D'Ericco will have some further suggestions when he gets back.
Realized one issue with above code. The pair of points may intersect the same boundary. In this case the line connecting them run along the boundary. Since a sample is generated at a random point along this line, it will also be on the boundary. This results in some oversampling of the boundary. Insteady of using a pair of random points on the boundary I could instead find a first point on the boundary as before, but then to find the second point on the boundary I could look along a direction perpindicular to the first, so I wouldn't hit the same boundary
Yes, Jon I also realized same issue that is too many point fall on the boundaries. I am evauating a scheme where I produce radom points with using randfixedsum routine. In first step, generate a fixed sum that is different than the Xtotal and then I use a minimum norm solution to do sort of back-filling and thus reachout the boundaries and fullfil sum constrainst. So, far results are encouraging if the they stay good, I am sharing the details shortly.
OK Here is an improved version that eliminates the oversampling of the boundaries. Please give this a try. In the new approach I still generate a random pair of points at the boundary and then select a random point on the line that connects them, but now I make sure the second point is not on the same boundary.
Unfortunately, I think I can see that there may be somewhat of a bias towards having points that are near the boundaries that are closest to the particular solution. When you do a random "spin" to get the first point, it is more likely to point at a boundary that is close to the local origin (the particular solution) as the "included angle" will be bigger for hitting that boundary than one the is far away. The second point, which I don't allow to be on the same boundary is then most likely to be on the next nearest boundary. The point that is generated on the line that connects these. So I think more of the points will be located between the two boundaries that are closest to the local origin. Oh well. Sorry, I guess this problem is really challenging. I probably shouldn' t post any more until/if I really have thought it through. I will leave it for now.
In any case, here are the results using
numNeeded = 10000; % number of random vectors that are needed
Etotal = 3;
d = [2; 0.5; 1;]; % upper bound on components
tic
[E,x0] = hboxPlaneSampling03(numNeeded,d,Etotal);
Ok, lets see first what the general problem requires.
You want to find a uniformly distributed random point subject to a constraint on the sum of those numbers, and the elements of the sum are sampled randomly from within specified ranges, which are not all the same. (A lot easier if the lower and upper bounds are the same, as then Roger's randfixedsum really is the best solution, since it does yield a uniform solution on the domain of interest.)
So first, lets look at the geometry of the problem, in the N-dimensional space of interest. N here is the number of elements in the vector. At best, I'll need to give examples in 2 or 3 dims, since I can't plot more than that, and then I'll hope to provide intuition about what is happening in high dimensions.
You can think about the bound constraints as creating a box in N dimensions. If the bounds are all the same, then the box is just a hyper cube. But with different bounds for each variable, the box converts to a hyper-rectangle. I suppose you could transform the problem so the hyper-rectangle maps into the unit hyper-cube, but then the sum constraint transforms into a more general linear combination of the variables, so nothing really any less complex.
In n-dimensions, a simple linear constraint reduces to a hyper-plane. I would call it a planar (n-1)-manifold embedded in the N-dimensional space. This, in 2 dimensions, the sum constraint reduces to a line. The lower and upper bounds reduce that further into a line segment. Moving up to 3 dimensions, the sum constraint forces all points to fall on a plane in the R^3 space of the elements, so now a 2-manifold, embedded in that R^3 space. I'll plot a few examples, just to give an idea of what is happening. Consider a box with bounds as
0 <= x1 <= 0.5
0.25 <= x2 <= 1
1 <= x3 <= 2
Next, I'll show the rectangular box of interest, combined with the domains of 4 slices through that box, representing 4 different possible sum constraints. (I've used several tools to do this, not all of which are available to others. Sorry about that, but the lack therof is not relevant to this discussion.). The specific sums I chose were 1.5 (purple), 2 (blue), 2.4 (green), and 3 (yellow).
As you should see, each constraint region is planar. They are in fact parallel planes, though that may not be obvious from this viewpoint.
What is pertinent to this discussion is they are polygonal regions, having a variety of numbers of sides, depending on where the plane intersects the 3-d rectangular region. That aspect is important to the discussion. In fact, I would describe the planes as 2-manifolds, embedded in the 3-d space of the box. The boundary polygon of the regions is a closed 1-manifold, a piecewise linear line or polygon. This is something you need to appreciate, because it will be important when we next talk about sum constraints in 4 or 5 or 20 or 100 dimensions.
For the 3-d box shown above, our goal would be to sample the regions displayed uniformly. That in itself is not too difficult. I do it in the code for randFixedLinearCombination. (You may decide the scheme I used in there is of interest. I think the code is reasonable, well documented, and easy enough to read. Note that surprisingly, the code may seem a little out of date, since it uses histc insead of discretize, and bsxfun, instead of the simpler scalar dimension expansions that MATLAB now allows. One thing that is pertinent are my comments in the code where I explicitly explain why the code would be computationally intractable for large numbers of dimensions, certainly on the order of 40 or larger. Of course, now that I look at my own code, I see one aspect where I could have improved the efficiency just a smidge more. Oh well - add that to my list of things to do.)
Regardless, next, suppose we have a problem where there are more or fewer dimensions? In two dimensions the problem is simple. We have a rectangle now, and the sum constraint results in a line segment, thus a bounded 1-manifold, in a 2-d space. The boundaries of the 1-manifold are a pair of points, the ends of the line segment, which we might think of as a 0-manifold. Our problem here reduces to sampling uniformly along the line segment, thus a relatively trivial task.
In this next figure, you can see the box [0,0.5] X [0.25,1]. I've added line segments there to show the sum constraints, corresponding to sums of 0.5 (black), 0.75 (blue), 1 (yellow, with black dots) and 1.25 (green).
Here you should see the lines are indeed parallel to each other. (In theory, they would all lie at 45 degrees in the figure, except the differential axis scaling confuses that issue.)
The important thing to understand is the box lives in 2-dimensions. The sum constraint reduces that to a 1-manifold, thus a line segment, which is bounded at the ends by a pair of points. What matters from the 2-d and 3-d cases is to see that they are related, that we can now move forward to the 4-d rectangular box problem. Sadly though, I cannot plot most of what happens, and you are now forced to rely on intuition.
Thus, suppose we wish to sample inside a 4-d hyper rectangular box forming the bounds, along with a user specified constraint on the sum? The constraint implies a hyper-planar subset in the 4-d space, so a 3-manifold. The boundaries of the planar region will be a 2-manifold. We can think of it as a polyhedron of sorts. It would look roughly like a soccer ball (ok, a football, to those who do not live in the US) with planar 2-d polygonal facets. The exact shape would depend upon the value of the sum, much like the shape of the polygonal region changed as we changed the sum for the 3-d case.
As an example, remember that I cannot plot the result! (Sorry. Well, I could plot it, but my hyper-dimensional monitor is broken, and will be until Starfleet command sends the necessary parts.) Consider the box defined as:
0 <= x1 <= 0.5
0.25 <= x2 <= 1
1 <= x3 <= 2
0.5 <= x4 <= 1.5
Again, a 4-d hyper-rectangle, so impossible to visualize it easily. As the only thing I can do, here is an example of the resulting polyhedron, plotted in the now 3-dimensional subspace of the hyperplane, for the above case where the sum was set to 3.2.
I want you to here visualize planar polygonal facets, with vaying numbers of sides for each facet. Sadly, the tool I used to plot this shows those polygonal facets in a triangulated form. I hope that is not confusing, but I'm feeling too lazy here to remove the triangulations of the facets. The point is to recognize the goal is to now sample the interior of that volume in a uniform way.
The plot above is about the limit of where I can go graphically. However, in 20 or 100 dimensions, or generally in N dimensions, the above ideas still hold. The sum constraint reduces the solution space to a planar (N-1)-manifold. The boundaries of the region, thus where the constraint plane crosses the box of interest will be a closed piecewise planar (N-2)-manifold. Yes, it is difficult to visualize, but think of it as a high dimensional "polyhedron" that itself lives in a hyper-planar subspace of the original box. You might call it an N-1 dimensional blob, or potatoe, but one that has flat polytope facets in that subspace. Regardless, that is the set one will need to sample from, the goal being to do so uniformly, so that the density of samples in every part of that subset will be uniform, given a sufficiently large sample size.
I hope that explains the problem under discussion here. The way I chose to solve the problem in randFixedLinearCombination is to triangulate the region of interest, and then sample uniformly from the tessellated sub-domain. (A problem which is not that difficult to solve, which explains the algorithmic decision I chose.) The issue which is relevant here is that my solution will fail when the problem moves into too many dimensions, so Jon wants to solve it using a different, more ad hoc scheme. The problem then becomes to find a scheme that would result in a quasi-uniform sampling of the region of interest, yet also be computationally viable in a high number of dimensions.
Sigh. I'll post this as a comment, and next, I'll try to take a more serious look at what Jon has tried, in context of what I've explained above. At least I can try to explain why what he has done fails to be uniform, and perhaps to offer some vague idea of how it might be improved.
Hi John, Thank so much for your really indepth, and beautifully illustrated explanation of the underlying geometry of the problem. For some reason I have become really captivated by this problem, although I don't have any direct application for it. It was really great to find that this morning, as the day before I had spent a somewhat distracted afternoon hiking with my dogs trying to visuallize all of this in higher dimensions. So your discussion was particulary appreciated. Actually up until thinking about all of this, I had always just thought of hyperplanes, as a natural generatlization of planes, and imagined them as infinitetly thin sheets (like a plane in 3d or a line in 2d), but now am realizing that they are in general, in some sense quite "fat", being only one dimension less than the one they are embedded in. I know that I am going off on a tangent here but, it makes me wonder what is really the defining nature of a plane, maybe that it has a direction that is perpindicual to it that is not in the set, or that it divides the overall space into two halves one that is on one "side" of the plane and one that is on the other side of the plane.
I also just looked briefly at the following relevant link:https://mathoverflow.net/questions/76255/random-sampling-a-linearly-constrained-region-in-n-dimensions, and in case you were interested ran down the references that it linked to, some of which were broken and attached them here. The concept of all of those is sampling the bounded region through random walks, which seems quite interesting. I will also say that although I understand the very big picture concept, the details of those papers are well beyond what I could follow.
I also attached an interesting one on the hyperbox- hyperplane intersection problem.
I wanted to let you know too, that in my suggested ideas for solving this problem I wasn't in any way suggesting that your approach wouldn't be the way to go as long as the dimensionality was not limiting. In generating some ideas, I very optimistically thought I could come up with a nice clean way to do this that would not suffer too much from the "curse of dimensionallity". I am now quite humbled, by the subtleties of the problem.
Thanks again for all of the time you have put into, this. As I said the topic somehow really grabbed me and your explanations have been really enlightening.
Now that I know some better terms to search under, I found this in the MATLAB file exchange from Tim Benham. Haven't had a chance to try it but perhaps it might be interesting. Looks like it uses the random walk approach (actually has options for various specific methods) https://www.mathworks.com/matlabcentral/fileexchange/34208-uniform-distribution-over-a-convex-polytope
Jon - I'm glad you have found the ideas here to have been stimulating. There are often a variety of ways to solve any problem, with some schemes better for large or small problems. Sometimes that requires an understanding of widely differing areas of mathematics, seeing how you can bring tools from a variety of domains to bear on the problem at hand. And it seems you can almost always find some new way to tweak an algorithm, getting an extra boost.
If you do come up with a nice code, then don't forget to post it on the file exchange. Spend some time if you can to test the properties of that scheme. Is it uniform, or perhaps just very close? Provably so, or is the algorithm a bit too ad hoc to see any direct proof? How fast is it? Does it have perform especially well in a high number of dimensions, or is there some sort of a rough limit on that?

Accedi per commentare.

Hi John, In case I have a chance to work on this some more, and for future reference, do you have any advice on quantitative tests that can be made to assess to what extent a collection of high dimensional points can be considered to have come from a uniform distribution.
In other words, if a scheme isn't uniform by construction, like yours, is there a way to quantitavely assess how well it is doing in terms of achieving uniform sampling?
I took a quick look, but couldn't find anything that was quickly accessible in terms of an easily implemented formulae, or even better something already in the file exchange. Thanks

Richiesto:

il 18 Feb 2020

Risposto:

Jon
il 10 Mar 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by