Does fitdist work for fitting a distribution to truncated data?

13 visualizzazioni (ultimi 30 giorni)
Does fitdist work for fitting a distribution to truncated data?
Indeed, it looks like that the mean of the the fitting distribution and the mean of the truncated data are visually different (please see the following plot with the blue histogram and the red line).
Here my example:
% create a set of "truncated data"
pd = makedist('Normal','mu',3);
t = truncate(pd,3,inf);
data = random(t,10000,1);
% fit the normal distribution to the "truncated data"
pd_fit = fitdist(data,'normal');
xgrid = linspace(0,100,1000)';
mypdf = pdf(pd_fit,xgrid);
% plot the "truncated histogram" and the fitting distribution
hold on
histogram(data,100,'Normalization','pdf')
line(xgrid,mypdf,'Linewidth',2,'color','red')
hold off
xlim([0 10])

Risposta accettata

Angelo Yeo
Angelo Yeo il 19 Giu 2023
Of course it works for truncated data. You can see the mean of the data and the "mu" of "pd_fit" is same.
% create a set of "truncated data"
pd = makedist('Normal','mu',3);
t = truncate(pd,3,inf);
data = random(t,10000,1);
% fit the normal distribution to the "truncated data"
pd_fit = fitdist(data,'normal');
xgrid = linspace(0,100,1000)';
mypdf = pdf(pd_fit,xgrid);
% plot the "truncated histogram" and the fitting distribution
hold on
histogram(data,100,'Normalization','pdf')
line(xgrid,mypdf,'Linewidth',2,'color','red')
hold off
xlim([0 10])
%%
mean(data)
ans = 3.7992
pd_fit.mu
ans = 3.7992
  2 Commenti
Sim
Sim il 19 Giu 2023
Modificato: Sim il 19 Giu 2023
Thanks a lot to both @Angelo Yeo and @Ayush Kashyap!!
Sorry, another doubt, since I got a bit confused........ (I can open another thread/question if you deem appropriate)
I was thinking that once a person needs to infer the distribution that most likely fits his/her own data, this person would probably search for a distribution that could fit nicely the entire set of data (and not just the truncated part), even when only the truncated part could be available, as in this case.
Then, wouldn't it be something like in this figure? (please see the blue line representing the fitting distribution, as if we had the entire dataset available)
If we could have the entire set of data (instead of just the truncated part as in my case), I am not sure that fitdist could reproduce something similar to the above mentioned figure (and if it could even make sense!).
Here below I compared the usage of fitdist for the case of a "full dataset" and a "truncated dataset", but I was not able to reproduce the blue line as in the above mentioned figure:
% from a normal probability distribution, i.e. "makedist('Normal','mu',3)",
% create:
% (i) a "full dataset" and
% (ii) a set of "truncated data"
pd = makedist('Normal','mu',3);
t = truncate(pd,3,inf);
data_full = random(pd,10000,1);
data_trunc = random(t,10000,1);
% fit the normal distribution to
% (i) the "full dataset"
% (ii) the set of "truncated data"
pd_fit_full = fitdist(data_full,'normal');
pd_fit_trunc = fitdist(data_trunc,'normal');
% plot
% (i.a) the "histogram of the full dataset" (from the "full dataset")
% (i.b) the density function corresponding to the distribution that fits the "full dataset"
% (ii.a) the "truncated histogram" (from the "truncated data")
% (ii.b) the density function corresponding to the distribution that fits the "truncated histogram"
xgrid = linspace(0,100,1000)';
hold on
histogram(data_full,100,'Normalization','pdf','facecolor','blue')
line(xgrid,pdf(pd_fit_full,xgrid),'Linewidth',2,'color','blue')
histogram(data_trunc,100,'Normalization','pdf','facecolor','red')
line(xgrid,pdf(pd_fit_trunc,xgrid),'Linewidth',2,'color','red')
hold off
xlim([0 10])
Angelo Yeo
Angelo Yeo il 20 Giu 2023
It's not possible to get the blue curve in the picture below only with the truncated data.
From what feature can computer assume the parameters of the curve?

Accedi per commentare.

Più risposte (1)

Ayush Kashyap
Ayush Kashyap il 19 Giu 2023
Indeed, `fitdist` can be used to fit a statistical distribution to truncated data.
It is valid to fit a distribution to truncated data if we believe that the underlying distribution of the data follows a specific distribution, but the data is censored below or above certain values.
In your example, you are generating truncated random data using the normal distribution, truncating it from the left at 3, and then fitting the normal distribution to it. This is a valid approach if you believe that the underlying distribution of your data is normal, but that the data is censored on the lower end at 3.
The difference between the mean of the fitted distribution and the mean of the truncated data that you are observing in the plot could be due to several reasons such as:
  • One possibility is that the truncation of the data has affected the mean, leading to a biased estimate of the mean from the truncated sample.
  • Another possibility is that the normal distribution may not be a good fit for your data, which could lead to differences in the mean and other parameters of the distribution.
To address this issue,
  • you could consider using alternative distributions that account for truncation or have heavier tails.
  • You could also compare the fit of the normal distribution to alternative distributions using goodness-of-fit tests or other evaluation metrics.
  • Additionally, you could consider using methods that are specifically designed for fitting distributions to truncated data, such as the method of moments or maximum likelihood estimation with modified likelihoods that account for truncation.
  1 Commento
Sim
Sim il 19 Giu 2023
Modificato: Sim il 19 Giu 2023
Thanks a lot for your comment @Ayush Kashyap!
About your comments:
"you could consider using alternative distributions that account for truncation or have heavier tails."
An example of distribution that accounts for truncation?
"You could also compare the fit of the normal distribution to alternative distributions using goodness-of-fit tests or other evaluation metrics."
Could you name a few goodness-of-fit tests or other evaluation metrics?
"Additionally, you could consider using methods that are specifically designed for fitting distributions to truncated data, such as the method of moments or maximum likelihood estimation with modified likelihoods that account for truncation."
An example of usage?

Accedi per commentare.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by