Solving for the parameters of a distribution

8 visualizzazioni (ultimi 30 giorni)
If I know 2 percentiles, say the 1/40 and the 1/100, then how do I solve for the parameters of a distribution e.g. lognormal. Is there an in built function in matlab that does this?
e.g. say the 1/40 percentile is 217,800 and the 1/100 percentile is 1,250,000. What are the parameters of the lognormal distribution that fits through these points.
Thanks

Risposta accettata

the cyclist
the cyclist il 14 Feb 2023
Modificato: the cyclist il 14 Feb 2023
Using the formula for the CDF, from the Wikipedia page that @Torsten referenced:
syms mu sig
x1 = 217800;
x2 = 1250000;
eqn1 = erfc(-(log(x1)-mu)/(sig*sqrt(2)))/2 == 1/100;
eqn2 = erfc(-(log(x2)-mu)/(sig*sqrt(2)))/2 == 1/40;
sol = vpasolve(eqn1,eqn2,mu,sig); % solve function did not manage this symbolically, and reported vpasolve result
sol.mu
ans = 
23.385919484067596489312881405711
sol.sig
ans = 
4.7691005796632736971218010173524
% Verify
double(logncdf(x1,sol.mu,sol.sig))
ans = 0.0100
double(logncdf(x2,sol.mu,sol.sig))
ans = 0.0250
  1 Commento
the cyclist
the cyclist il 14 Feb 2023
Modificato: the cyclist il 14 Feb 2023
The 1/100 quantile must be smaller than the 1/40, by definition of the cumulative distribution function. So, I swapped the values.
Instead, maybe you meant the 99th and 97.5th percentile?

Accedi per commentare.

Più risposte (2)

John D'Errico
John D'Errico il 14 Feb 2023
Modificato: John D'Errico il 14 Feb 2023
It is rarely a good idea to try to infer distribution parameters from deep tail percentiles. Those points deep in the tails would be determined from only a few data points. As such, they are very poorly known. And that means your estimates of the distribution parameters will be highly variable, highly sensitive to noise.
Even worse,by the use of two quantiles very near to each other, you are effectively using only a very few points way into the fringes, and they are the same data points used for both quantiles.
Instead you will be far better off using a quantiles that would be closer to the mode of the distribution. For example, you might have chosen the 50 and 75% points.
Is it a great idea to use only two quantiles to try to infer the distribution parameters? Probably not. But if you are going to do something, you want to choose a scheme that is robust to noise.
As an example, I'll choose a random sampling of 1000 numbers from a lognormal distribution. Then compute several sets of quantiles, and see how well I can recover the original distribution parameters for the lognormal. Next, I'll do this operation multiple times, and we can see how robust is the operation, to infer distribution parameters from two quantiles, and see how well the computation performs, depending on the actual quantiles chosen. Surely this is a fair test.
The actual distribution parameters for the lognormal are irrelevant, since that does nothing to change such a test. As such, I'll choose a standard lognormal for my computations. A standard lognormal has distribution parameters of mu=0, sigma=1. Again, all that matters is how well I can predict those numbers.
For the computational engine, I used the code written by @the cyclist, which seems reasonable (though really slow, since it uses vpasolve). fsolve would be faster, but vpasolve will do entirely well for my purposes. First, does the code work?
X = lognrnd(0,1,[1,1000]);
P = [50 75];
X = prctile(X,P)
X = 1×2
0.9591 1.8413
[muest,sigest] = lognest(P,X)
muest = -0.0417
sigest = 0.9669
As you can see, this code seems to work. It reproduces the population parameters reasonably well.
Next, a simulation, to test the performance of the scheme, using various quantile sets. I'll choose two quantile sets, one deep in the tails, and the second using quantiles that will live where the distribution has significant mass.
Qset1 = [50 75];
Qset2 = [97.5 99]; % The set chosen by RP
Nsim = 1000; % Total number of sims to be performed. relatively small, but adequate
SimSize = 1000; % the number of sampels taken within each simulated run
muest = zeros(2,Nsim);
sigest = zeros(2,Nsim);
for i = 1:Nsim
Xsim = lognrnd(0,1,[1,1000]);
Xset1 = prctile(Xsim,Qset1);
Xset2 = prctile(Xsim,Qset2);
[muest(1,i),sigest(1,i)] = lognest(Qset1,Xset1);
[muest(2,i),sigest(2,i)] = lognest(Qset2,Xset2);
end
% Next, plot histograms of the population parameter estimates
hist(muest',100)
title('Histogram of estimates for mu')
hist(sigest',100)
title('Histogram of estimates for sigma')
In each histogram, you can see the two sets of estimates for the parameters mu and sigma. In blue are the estimates when the 50% and 75% quantiles were used to infer the lognormal parameters. In yellow, you see the estimates of those same two parameters, but this time using the 97.5% and the 99% quantiles from the data.
function [muest,sigest] = lognest(perc,x)
% Given two quantile amounts, and the locations for x of those measured
% quantiles from a sample, return the lognormal parameters mu, sigma
% perc is assumed to be scaled as a percentile, so in the interval
% [0,100].
syms m s real
eqn1 = erfc(-(log(x(1))-m)/(s*sqrt(2)))/2 == perc(1)/100;
eqn2 = erfc(-(log(x(2))-m)/(s*sqrt(2)))/2 == perc(2)/100;
[muest,sigest] = vpasolve(eqn1,eqn2,[m,s]);
muest = double(muest);
sigest = double(sigest);
end
You need ot understand that the scheme you want to use to infer the lognormal parameters using deep tail quantiles is a poor one. It is highly sensitive to random perturbations from only a few data points. As such, it is not at all robust. Worse, it will also be highly sensitive to deviations from a true lognormal distritribution, since it is deep in the tails that a lognormal distribution will be most different from some other distribution.
Finally, using only two quantiles to infer the population parameters is itself arguably a poor scheme, since you are not really using the entire set of data. A better method would be maximum likelihood estimation, where all of the data can be employed, instead of effectively only a few data points.
  1 Commento
the cyclist
the cyclist il 14 Feb 2023
Very nice analysis, @John D'Errico. For me, it brought back not-so-fond memories of when I tried to explain to finance execs why their estimates of 95th percentile were garbage, because of the small sample (typically less than N=100) used to estimate it.
FYI, I agree that fsolve was the better tool here. But I don't have the Optimization Toolbox, and didn't want to try to do the coding here instead of my local machine. :-)

Accedi per commentare.


Torsten
Torsten il 14 Feb 2023
The formula for the quantiles of the lognormal distribution as a function of p, mu and sigma is given here:
Thus 2 equations in 2 unknowns.

Prodotti


Release

R2022a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by