Goodness of fit for a Zipf distribution

11 visualizzazioni (ultimi 30 giorni)
z8080
z8080 il 30 Mar 2023
Modificato: Torsten il 4 Apr 2023
I have several vectors of numbers and would, for each one, like to fit a Zipf distribution, and estimate the goodness of fit relative to some standard benchmark.
From here, I got the formula for the PMF/PDF of the Zipf distribution, which I used as the third argument for fit in the following example:
x = [1:6]';
y = [80 40 20 10 5 2]';
plot(x,y,'or'), hold on % plot data points
[f, gof_info] = fit(x,y,'x.^(-(rho + 1)) ./ zeta(rho + 1)');
plot(f,'k') % plot model curve
However, this returns
rho = 28.05 (-Inf, Inf)
and therefore no model curve is plotted, and I'm not sure how to then continue with the goodness of fit test.
My questions:
1) Assuming the Zipf formula above is correct and rho is the only model parameter, why is this estimated with infinite confidence intervals?
2) This suggests using a Kolmogorov-Smirnov test (with bootstrapping to check significance) to compare the data to the theoretical Zipf's law distribution. However, the Matlab function kstest seems to refer to normal distributions only, with no option to specify another distribution type (e.g. Zipf). Others suggest other tests such as Anderson-Darling, but that too refers only to normal distributions.
3) Which of the goodness of fit parameters in gof_info output (SSE, R square etc.) needs to be passed to the significance test (Kolmogorov-Smirnov or Anderson-Darling)?
Many thanks for any help!
  2 Commenti
Torsten
Torsten il 30 Mar 2023
I don't understand why you use curve fitting when you have to do distribution fitting.
Read about the difference and the adequate functions:
z8080
z8080 il 31 Mar 2023
Thanks for pointing this out: indeed, seems I was oblivious about the difference between those two, and it's really distribution fitting that I need, since my x axis is merely rank.
The documentation you pointed to suggests to me it's fitdist that I should use, however that doesn't have the distribution I need (Zipf) among its possible distname values. Any more advice?

Accedi per commentare.

Risposta accettata

Torsten
Torsten il 31 Mar 2023
Modificato: Torsten il 2 Apr 2023
I set up the maximum likelihood function for your data and came up with an estimated value of s = 2.116 for the distribution parameter s of the Zeta distribution with p_s(k) = 1/k^s / zeta(s) (k=1,2,3...). I don't know if this is the discrete probability density function we are talking about or if the distribution you have in mind has finite support and is the Zipf's distribution (i.e. p_s(k) = 1/k^s / sum_{i=1}^{i=N} 1/i^s (k=1,2,...,N)):
If you want the version with finite support (Zipf's distribution) (e.g. N = 6 in your case), replace
fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / zeta(smax);
by
fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
zipf = 1./x.^smax / sum(1./x.^(smax));
x = [1:6]';
y = [80 40 20 10 5 2]';
X = [];
for i=1:numel(x)
X = [X,x(i)*ones(1,y(i))];
end
hold on
histogram(X,'Normalization','pdf')
%fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);
s = 1.5:0.001:3;
fs = fun(s);
[~,index] = max(fs);
smax = s(index)
smax = 2.1160
zipf = 1./x.^smax / zeta(smax);
%zipf = 1./x.^smax / sum(1./x.^(smax));
plot(x,zipf)
hold off
grid on
  11 Commenti
z8080
z8080 il 4 Apr 2023
Modificato: z8080 il 4 Apr 2023
Here's my attempt so far (for some reason code highlighting doesn't work):
% Define some empirical frequency distribution
x = 1:10;
freq = 100 ./ x; % textbook zipf!
% Define the Zipf distribution
alpha = 1.5; % Shape parameter, 1.5 is apparently a good all-round value to start with
N = sum(freq); % Total number of observations
k = 1:length(x); % Rank of each observation
zipf_dist = N ./ (k.^alpha); % Compute the Zipf distribution
% Plot our empirical frequency distribution alongside the Zipf distribution
figure;
bar(x, freq); % or freq\N
hold on;
plot(x, zipf_dist, 'r--');
xlabel('Rank');
ylabel('Frequency');
legend('Observed', 'Zipf');
% Compute the goodness of fit using the chi-squared test
expected_freq = zipf_dist .* N;
chi_squared = sum((freq - expected_freq).^2 ./ expected_freq);
dof = length(freq) - 1;
p_value = 1 - chi2cdf(chi_squared, dof);
% Display the results
fprintf('Chi-squared statistic = %.4f\n', chi_squared);
Chi-squared statistic = 170591.8168
fprintf('p-value = %.4f\n', p_value);
p-value = 0.0000
if p_value < 0.05
fprintf('Conclusion: The data is not from a Zipf distribution.\n');
else
fprintf('Conclusion: The data is from a Zipf distribution.\n');
end
Conclusion: The data is not from a Zipf distribution.
As a sanity check, I used a textbook-Zipf rank distribution, which unfortunately doesn't pass the statistical test. If that doesn't, nothing will. Clearly something is wrong, but I don't know what. Using the Kolmogorov-Smirnoff test, or the Anderson-Darling test with a custom-built (non-normal) distribution in place of the chi-squared test does not change this.
THoughts?
Torsten
Torsten il 4 Apr 2023
Modificato: Torsten il 4 Apr 2023
Why is "freq" non-integer for x = 3,6,7,8,9 ? I thought it is the absolute frequency that x is chosen.
To compare with the zipf distribution, you must calculate the empirical pdf from your (x,freq) data. It's given as epdf = freq/sum(freq).
After this, zipf must be defined as
zipf = 1./x.^alpha / sum(1./x.^alpha);
% Define some empirical frequency distribution
x = 1:10;
freq = 100 ./ x; % textbook zipf!
epdf = freq/sum(freq);
% Define the Zipf distribution
alpha = 1.0; % Shape parameter, 1.5 is apparently a good all-round value to start with
zipf_dist = 1./x.^alpha ./ sum(1./x.^alpha); % Compute the Zipf distribution
% Plot our empirical frequency distribution alongside the Zipf distribution
figure(1);
bar(x, epdf); % or freq\N
hold on;
plot(x, zipf_dist, 'r--');
xlabel('Rank');
ylabel('Probability');
legend('Observed', 'Zipf');
hold off
fun = @(alpha)prod(1./x.^(alpha*freq))./((sum(1./x.^alpha))^(sum(freq)));
alpha = 0.5:0.001:3;
falpha = arrayfun(@(alpha)fun(alpha),alpha);
figure(2)
plot(alpha,falpha)
[~,index] = max(falpha);
alphamax = alpha(index)
alphamax = 1
% Compute the goodness of fit using the chi-squared test
expected_freq = zipf_dist .* sum(freq);
chi_squared = sum((freq - expected_freq).^2 ./ expected_freq);
dof = length(freq) - 1;
p_value = 1 - chi2cdf(chi_squared, dof);
% Display the results
fprintf('Chi-squared statistic = %.4f\n', chi_squared);
Chi-squared statistic = 0.0000
fprintf('p-value = %.4f\n', p_value);
p-value = 1.0000
if p_value < 0.05
fprintf('Conclusion: The data is not from a Zipf distribution.\n');
else
fprintf('Conclusion: The data is from a Zipf distribution.\n');
end
Conclusion: The data is from a Zipf distribution.

Accedi per commentare.

Più risposte (1)

Walter Roberson
Walter Roberson il 30 Mar 2023
The zeta() function is 0 for all even negative integers, so zeta(rho+1) is 0 for all odd negative integers smaller than -1. You divide by that, so your function oscillates infinitely when rho is negative.
  4 Commenti
z8080
z8080 il 31 Mar 2023
Modificato: z8080 il 31 Mar 2023
Thanks, but since the task I'm aiming for is fairly simple (see how well a Zipf distribution fits a bunch of numbers), surely there's a straightforward way to achieve that in Matlab? Edge cases such as the one you demonstrated suggest, if anything, that what I want cannot be achieved, which I doubt.
In sum, if restricting rho to positives doesn't get the job done, what does get it done?
Walter Roberson
Walter Roberson il 31 Mar 2023
One possibility is that the particular example numbers you posted is not well fit by a Zipf distribution.

Accedi per commentare.

Prodotti


Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by