Documentation

# `stats`::`ksGOFT`

The Kolmogorov-Smirnov goodness-of-fit test

MuPAD® notebooks will be removed in a future release. Use MATLAB® live scripts instead.

MATLAB live scripts support most MuPAD functionality, though there are some differences. For more information, see Convert MuPAD Notebooks to MATLAB Live Scripts.

## Syntax

```stats::ksGOFT(`x1, x2, …`, `CDF = f`)
stats::ksGOFT(`[x1, x2, …]`, `CDF = f`)
stats::ksGOFT(`s`, <`c`>, `CDF = f`)
```

## Description

`stats::ksGOFT`([x1, x2, …], CDF = f) applies the Kolmogorov-Smirnov goodness-of-fit test for the null hypothesis: “x1, x2, … is an `f`-distributed sample”.

External statistical data stored in an ASCII file can be imported into a MuPAD® session via `import::readdata`. In particular, see Example 1 of the corresponding help page.

An error is raised if any of the data cannot be converted to a real floating-point number.

Let y1, …, yn be the input data x1, …, xn arranged in ascending order. `stats::ksGOFT` returns the list containing the following information:

1. `K1` is the Kolmogorov-Smirnov statistic .

2. `p1` is the observed significance level of the statistic `K1`.

3. `K2` is the Kolmogorov-Smirnov statistic .

4. `p2` is the observed significance level of the statistic `K2`.

For the Kolmogorov-Smirnov statistic K corresponding to K1 or K2, respectively, the observed significance levels `p1`, `p2` are computed by an asymptotic approximation of the exact probability .

For large n, these probabilities are approximated by .

Thus, the observed significance levels returned by `stats::ksGOFT` approximate the exact probabilities for large n. Roughly speaking, for n = 10, the 3 leading digits of `p1`, `p2` correspond to the exact probabilities. For n = 100, the 4 leading digits of `p1`, `p2` correspond to the exact probabilities. For n = 1000, the 6 leading digits of `p1`, `p2` correspond to the exact probabilities.

The observed significance level `PValue1 = p1` returned by `stats::ksGOFT` has to be interpreted in the following way:

Under the null hypothesis, the probability p1 = Pr(K > K1) should not be small. Specifically, p1 = Pr(K > K1) ≥ α should hold for a given significance level . If this condition is violated, the hypothesis may be rejected at level α.

Thus, if the observed significance level `p1` = Pr(K > K1) satisfies p1 < α, the sample leading to the value `K1` of the statistic K represents an unlikely event and the null hypotheses may be rejected at level α.

The corresponding interpretation holds for ```PValue2 = p2```: if `p2 = Pr(K > K2)` satisfies p2 < α, the null hypotheses may be rejected at level α.

Note that both observed significance levels `p1`, `p2` must be sufficiently large to make the data pass the test. The null hypothesis may be rejected at level α if any of the two values is smaller than α.

If p1 and p2 are both close to 1, this should raise suspicion about the randomness of the data: they indicate a fit that is too good.

Distributions that are not provided by the `stats`-package can be implemented easily by the user. A user defined procedure f can implement any cumulative distribution function; `stats::ksGOFT` calls f(x) with real floating-point arguments from the data sample. The function f must return a numerical real value between 0 and 1. Cf. Example 3.

## Environment Interactions

The function is sensitive to the environment variable `DIGITS` which determines the numerical working precision.

## Examples

### Example 1

We create a sample of 1000 normally distributed random numbers:

```r := stats::normalRandom(0, 1, Seed = 123): data := [r() \$ i = 1 .. 1000]:```

We test whether these data are indeed normally distributed with mean 0 und variance 1. We pass the corresponding cumulative distribution function ```stats::normalCDF(0, 1)``` to `stats::ksGOFT`:

`stats::ksGOFT(data, CDF = stats::normalCDF(0, 1))`
` `

The result shows that the data can be accepted as a sample of normally distributed numbers: both observed significance levels and are not small.

Next, we inject some further data into the sample:

```data := data . [frandom() \$ i = 1..100]: stats::ksGOFT(data, CDF = stats::normalCDF(0, 1))```
` `

Now, the data should not be accepted as a sample of normal deviates with mean 0 and variance 1, because the second observed significance level `PValue2` is very small.

`delete r, data:`

### Example 2

We create a sample consisting of one string column and two non-string columns:

```s := stats::sample( [["1996", 1242, PI - 1/2], ["1997", 1353, PI + 0.3], ["1998", 1142, PI + 0.5], ["1999", 1201, PI - 1], ["2001", 1201, PI]])```
```"1996" 1242 PI - 1/2 "1997" 1353 PI + 0.3 "1998" 1142 PI + 0.5 "1999" 1201 PI - 1 "2001" 1201 PI ```

We consider the data in the third column. The mean and the variance of these data are computed:

`[m, v] := [stats::mean(s, 3), stats::variance(s, 3)]`
` `

We check whether the data of the 3rd column are normally distributed with the mean and variance computed above:

`stats::ksGOFT(s, 3, CDF = stats::normalCDF(m, v))`
` `

Both observed significance levels and returned by the test are not small. There is no reason to reject the null hypothesis that the data are normally distributed.

`delete s, m, v:`

### Example 3

We demonstrate how user-defined distribution functions can be used. The following function represents the cumulative distribution function Pr(Xx) = x2 of a variable X supported on the interval [0, 1]. It will be called with floating-point arguments x and must return numerical values between 0 and 1:

```f := proc(x) begin if x <= 0 then return(0) elif x < 1 then return(x^2) else return(1) end_if end_proc:```

We test the hypothesis that the following data are f-distributed:

```data := [sqrt(frandom()) \$ k = 1..10^2]: stats::ksGOFT(data, CDF = f)```
` `

At a given significance level of 0.1, say, the hypothesis should not be rejected: both observed significance levels `p1` = and `p2` = exceed 0.1.

`delete f, data:`

## Parameters

 `x1, x2, …` The statistical data: real numerical values `f` A procedure representing a cumulative distribution function. Typically, one of the distribution functions of the `stats`-package such as `stats::normalCDF`(n, v) etc. `s` A sample of domain type `stats::sample` `c` An integer representing a column index of the sample `s`. This column provides the data `x1`, `x2` etc. There is no need to specify a column number `c` if the sample has only one column.

## Return Values

List with four equations `[PValue1 = p1`, ```StatValue1 = K1```, `PValue2 = p2`, ```StatValue2 = K2]```, with floating-point values `p1`, `K1`, `p2`, `K2`. See the “Details” section below for the interpretation of these values.

## References

D. E. Knuth, The Art of Computer Programming, Vol 2: Seminumerical Algorithms, pp. 48. Addison-Wesley (1998).