Cannot use 'histogram' to compute entropy
Mostra commenti meno recenti
I'd like to compute the entropy of various vectors. I was going to use something like:
X = randn(1,100);
h1 = histogram(X, 'Normalization', 'Probability');
probabilities = h1.Values;
entropy = -sum(probabilities .* log2(probabilities ))
The second command however gives the error:
Undefined function 'c:\Program Files\MATLAB\R2019b\toolbox\matlab\specgraph\histogram.m' for input arguments of type 'double'.
But surely that's exactly what the standard Matlab function 'histogram' expects?! Doing a
which histogram
indeed returns
C:\Program Files\MATLAB\R2019b\toolbox\matlab\specgraph\histogram.m
which is the newest file (by modified date) from several of that name that (sadly) exist in my Matlab folder. I believe this should be the standard Matlab function 'histogram'.
If on the other hand in the above example I use 'hist' instead of 'histogram', I get the scalar value for entropy that I expect. However, I know 'hist' is not recommended, not least because with it one cannot specify the normalization type.
So, my question is: is using 'hist' for computing probabilities ok, or should I try something else to be able to use 'histogram' instead?
13 Commenti
Walter Roberson
il 9 Set 2021
Please show the output of
dbtype histogram 1:5
Question: does your code just happen to assign a value to a variable named histogram at some point?
z8080
il 9 Set 2021
Bjorn Gustavsson
il 9 Set 2021
That you get a nan in the second variant is most likely because one of more of your probabilities are zero.
z8080
il 9 Set 2021
Walter Roberson
il 9 Set 2021
Right, you have to filter out the items with count 0.
Bjorn Gustavsson
il 9 Set 2021
Modificato: Bjorn Gustavsson
il 9 Set 2021
It is the zeros in the probabilities that leads to nans - you get terms of the form 0*log(0) in the entropy-calculation, which defaults to nan. Note that there's no zeros in the probabilities-vecor in your second example.
Bjorn Gustavsson
il 9 Set 2021
It should be simple enough to remove those zero-probability-bins:
probs = probabilities;
entropy = -sum(probs(probs(:)>0) .* log2(probs(probs(:)>0) ))
z8080
il 9 Set 2021
You have a finite sample of a distribution, and you are not specifying bin edges or the number of bins.
Under those circumstances, histogram() is documented as using the data to create bins of uniform width that represents the shape of the histogram. However, there is no documentation as to the algorithm it uses to select the bin widths (number of bins), and the relevant code is inside a .p file so we cannot look at it.
So you let histogram choose uniform bins in your finite distribution of data, using an unknown algorithm to select the bin widths, and some of the bins come up zero counts.
syms N positive
p = 1/10;
thresh = 1/100;
n = solve((1-p)^N == thresh)
vpa(n)
This calculates that if you have a bin with 10% probability, that you would have to take more than 43 samples before the probability dropped to less than 1/100 that the bin was empty. So, with finite samples, probability happens.
Walter Roberson
il 10 Set 2021
Depending on your knowledge of the distribution, it might make sense to take ask for the counts, and take max(1,counts) to substitute a nominal hit for each bin, and then calculate probability from that, as adjusted_counts ./ sum(adjusted_counts) .
The fewer samples you have, the more that distorts the probabilities; the more samples you have, the less likely you are to need it.
But I do recommend figuring out the number of bits yourself somehow or else you are going to continue to be at the mercy of its undocumented method of selecting the number of bins.
Risposte (0)
Categorie
Scopri di più su Data Distribution Plots in Centro assistenza e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!
