How to best determine the probability of a distribution given an outlying observation?

1 visualizzazione (ultimi 30 giorni)
Hi,
I have a classification problem. I have a set of data from a reference process (let's call that "known") and a set of data from a second process (let's call that "test").
Hypothesis 0 is that the test sample came from an identical process as the "known", and will therefore have the same distribution.
Hypothesis 1 is that the test sample came from a different process. However, here is the catch: for all but one sample, this process has an identical distribution to the "known". Just one sample will be "suspiciously" low.
I will add a picture to better explain:
In this case, the red histogram is the reference "known" distribution. The blue histogram is the questioned "test" distribution. In this case, I already know that the test came from a different process. It might not be completely clear due to the overlaying, but it can be seen that the distributions pretty well match, except for a single blue sample which is suspiciously low.
What I need now is to take each distribution and work out some method of returning a probability that the extremely low blue value would be observed given the distribution is the "known" distribution. I know how to calculate the probability of a particular single observation, but how do I properly balance this with the number of observations? Would just a KS test be appropriate? It strikes me as stats 101, but it's been a while, and I don't want to get this wrong.
Thanks in advance.

Risposta accettata

Ilya
Ilya il 12 Set 2012
Modificato: Ilya il 12 Set 2012
If you know the reference distribution analytically, you can compute its cdf at the smallest observed value. Suppose this cdf value is p. The p-value for your test would be then one minus the binomial probability of not observing any successes in N trials, where N is the sample size and p is the success probability. That is, it would be 1-(1-p)^N.
  1 Commento
Tim
Tim il 19 Set 2012
Oh, so obvious now! Thank you. I was over-thinking it with the variance of the variance and all that jazz. My only excuses are lack of sleep and rusty stats - honestly, I avoid them when I can.

Accedi per commentare.

Più risposte (1)

per isakson
per isakson il 12 Set 2012
See: FBD - "Find the Best Distribution" tool in the File Exchange
  1 Commento
Tim
Tim il 12 Set 2012
Thanks for your answer, per, but I'm not sure that this is what I'm looking for. I'll try and clarify with a simple code example.
KnownSet = randn(1000,1);
TestSet1 = randn(100,1);
TestSet2 = [randn(99,1); -4];
In this case, I know all three sets of data are mostly drawn from the same Gaussian distribution. However, TestSet2 has an outlier. The value -4 is very unlikely, and I'm hoping to use that single outlying value to provide a probability that each TestSet is purely from the same distribution as KnownSet. In this case, TestSet1 should have a high 'p-value', and TestSet2 should have a low 'p-value' and be rejected. I use the term p-value, but there might be something else.
FBD would help me determine the distribution of KnownSet (which I can assume is at least for the most part the same as that of the TestSets), but that is only the first step. How do I go from there to determining how likely/unlikely the set of observations is, given the distribution, and given the outlier?

Accedi per commentare.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by