# Goodness of fit for a Zipf distribution

11 visualizzazioni (ultimi 30 giorni)

Mostra commenti meno recenti

I have several vectors of numbers and would, for each one, like to fit a Zipf distribution, and estimate the goodness of fit relative to some standard benchmark.

From here, I got the formula for the PMF/PDF of the Zipf distribution, which I used as the third argument for fit in the following example:

x = [1:6]';

y = [80 40 20 10 5 2]';

plot(x,y,'or'), hold on % plot data points

[f, gof_info] = fit(x,y,'x.^(-(rho + 1)) ./ zeta(rho + 1)');

plot(f,'k') % plot model curve

However, this returns

rho = 28.05 (-Inf, Inf)

and therefore no model curve is plotted, and I'm not sure how to then continue with the goodness of fit test.

My questions:

1) Assuming the Zipf formula above is correct and rho is the only model parameter, why is this estimated with infinite confidence intervals?

2) This suggests using a Kolmogorov-Smirnov test (with bootstrapping to check significance) to compare the data to the theoretical Zipf's law distribution. However, the Matlab function kstest seems to refer to normal distributions only, with no option to specify another distribution type (e.g. Zipf). Others suggest other tests such as Anderson-Darling, but that too refers only to normal distributions.

3) Which of the goodness of fit parameters in gof_info output (SSE, R square etc.) needs to be passed to the significance test (Kolmogorov-Smirnov or Anderson-Darling)?

Many thanks for any help!

##### 2 Commenti

Torsten
il 30 Mar 2023

I don't understand why you use curve fitting when you have to do distribution fitting.

Read about the difference and the adequate functions:

### Risposta accettata

Torsten
il 31 Mar 2023

Modificato: Torsten
il 2 Apr 2023

I set up the maximum likelihood function for your data and came up with an estimated value of s = 2.116 for the distribution parameter s of the Zeta distribution with p_s(k) = 1/k^s / zeta(s) (k=1,2,3...). I don't know if this is the discrete probability density function we are talking about or if the distribution you have in mind has finite support and is the Zipf's distribution (i.e. p_s(k) = 1/k^s / sum_{i=1}^{i=N} 1/i^s (k=1,2,...,N)):

If you want the version with finite support (Zipf's distribution) (e.g. N = 6 in your case), replace

fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);

zipf = 1./x.^smax / zeta(smax);

by

fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);

zipf = 1./x.^smax / sum(1./x.^(smax));

x = [1:6]';

y = [80 40 20 10 5 2]';

X = [];

for i=1:numel(x)

X = [X,x(i)*ones(1,y(i))];

end

hold on

histogram(X,'Normalization','pdf')

%fun = @(s)1./(1+1./2.^s+1./3.^s+1./4.^s+1./5.^s+1./6.^s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);

fun = @(s)1./zeta(s).^157 .* 1./2.^(40*s) .* 1./3.^(20*s) .* 1./4.^(10*s) .* 1./5.^(5*s) .* 1./6.^(2*s);

s = 1.5:0.001:3;

fs = fun(s);

[~,index] = max(fs);

smax = s(index)

zipf = 1./x.^smax / zeta(smax);

%zipf = 1./x.^smax / sum(1./x.^(smax));

plot(x,zipf)

hold off

grid on

##### 11 Commenti

Torsten
il 4 Apr 2023

Modificato: Torsten
il 4 Apr 2023

Why is "freq" non-integer for x = 3,6,7,8,9 ? I thought it is the absolute frequency that x is chosen.

To compare with the zipf distribution, you must calculate the empirical pdf from your (x,freq) data. It's given as epdf = freq/sum(freq).

After this, zipf must be defined as

zipf = 1./x.^alpha / sum(1./x.^alpha);

% Define some empirical frequency distribution

x = 1:10;

freq = 100 ./ x; % textbook zipf!

epdf = freq/sum(freq);

% Define the Zipf distribution

alpha = 1.0; % Shape parameter, 1.5 is apparently a good all-round value to start with

zipf_dist = 1./x.^alpha ./ sum(1./x.^alpha); % Compute the Zipf distribution

% Plot our empirical frequency distribution alongside the Zipf distribution

figure(1);

bar(x, epdf); % or freq\N

hold on;

plot(x, zipf_dist, 'r--');

xlabel('Rank');

ylabel('Probability');

legend('Observed', 'Zipf');

hold off

fun = @(alpha)prod(1./x.^(alpha*freq))./((sum(1./x.^alpha))^(sum(freq)));

alpha = 0.5:0.001:3;

falpha = arrayfun(@(alpha)fun(alpha),alpha);

figure(2)

plot(alpha,falpha)

[~,index] = max(falpha);

alphamax = alpha(index)

% Compute the goodness of fit using the chi-squared test

expected_freq = zipf_dist .* sum(freq);

chi_squared = sum((freq - expected_freq).^2 ./ expected_freq);

dof = length(freq) - 1;

p_value = 1 - chi2cdf(chi_squared, dof);

% Display the results

fprintf('Chi-squared statistic = %.4f\n', chi_squared);

fprintf('p-value = %.4f\n', p_value);

if p_value < 0.05

fprintf('Conclusion: The data is not from a Zipf distribution.\n');

else

fprintf('Conclusion: The data is from a Zipf distribution.\n');

end

### Più risposte (1)

Walter Roberson
il 30 Mar 2023

##### 4 Commenti

Walter Roberson
il 31 Mar 2023

### Vedere anche

### Categorie

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!