# datasample

Randomly sample from data, with or without replacement

## Syntax

``y = datasample(data,k)``
``y = datasample(data,k,dim)``
``y = datasample(___,Name,Value)``
``y = datasample(s,___)``
``[y,idx] = datasample(___)``

## Description

example

````y = datasample(data,k)` returns k observations sampled uniformly at random, with replacement, from the data in `data`.```

example

````y = datasample(data,k,dim)` returns a sample taken along dimension `dim` of `data`.```

example

````y = datasample(___,Name,Value)` returns a sample for any of the input arguments in the previous syntaxes, with additional options specified by one or more name-value pair arguments. For example, `'Replace',false` specifies sampling without replacement.```

example

````y = datasample(s,___)` uses the random number stream `s` to generate random numbers. The option `s` can precede any of the input arguments in the previous syntaxes.```

example

````[y,idx] = datasample(___)` also returns an index vector indicating which values `datasample` sampled from `data` using any of the input arguments in the previous syntaxes.```

## Examples

collapse all

Create the random number stream for reproducibility.

`s = RandStream('mlfg6331_64'); `

Draw five unique values from the integers `1` to `10`.

`y = datasample(s,1:10,5,'Replace',false)`
```y = 1×5 9 8 3 6 2 ```

Create the random number stream for reproducibility.

`s = RandStream('mlfg6331_64');`

Generate `48` random characters from the sequence `ACGT` per specified probabilities.

`seq = datasample(s,'ACGT',48,'Weights',[0.15 0.35 0.35 0.15])`
```seq = 'GGCGGCGCAAGGCGCCGGACCTGGCTGCACGCCGTTCCCTGCTACTCG' ```

Set the random seed for reproducibility of the results.

`rng(10,'twister') `

Generate a matrix with 10 rows and 1000 columns.

`X = randn(10,1000);`

Create the random number stream for reproducibility within `datasample`.

`s = RandStream('mlfg6331_64');`

Randomly select five unique columns from `X`.

`Y = datasample(s,X,5,2,'Replace',false)`
```Y = 10×5 0.4317 -0.3327 0.9112 -2.3244 0.9559 0.6977 -0.7422 0.4578 -1.3745 -0.8634 -0.8543 -0.3105 0.9836 -0.6434 -0.4457 0.1686 0.6609 -0.0553 -0.1202 -1.3699 -1.7649 -1.1607 -0.3513 -1.5533 0.0597 -0.3821 0.5696 -1.6264 -0.2104 -1.5486 -1.6844 0.7148 -0.6876 -0.4447 -1.4615 -0.4170 1.3696 1.1874 -0.9901 0.5875 -0.2410 1.4703 -2.5003 -1.1321 -1.8451 0.6212 1.4118 -0.4518 0.8697 0.8093 ```

Resample observations from a dataset array to create a bootstrap replicate data set. See Bootstrap Resampling for more information about bootstrapping.

`load hospital`

Create a data set that has the same size as the `hospital` data set and contains random samples chosen with replacement from the `hospital` data set.

`y = datasample(hospital,size(hospital,1));`

Select samples from data based on indices of a sample chosen from another vector.

Generate two random vectors.

```x1 = randn(100,1); x2 = randn(100,1);```

Select a sample of `10` elements from vector `x1`, and return the indices of the sample in vector `idx`.

`[y1,idx] = datasample(x1,10);`

Select a sample of `10` elements from vector `x2` using the indices in vector `idx`.

`y2 = x2(idx);`

## Input Arguments

collapse all

Input data from which to sample, specified as a vector, matrix, multidimensional array, table, or dataset array. By default, `datasample` samples from the first nonsingleton dimension of `data`. For example, if `data` is a matrix, then `datasample` samples from the rows. Change this behavior with the `dim` input argument.

Data Types: `single` | `double` | `logical` | `char` | `string` | `table`

Number of samples, specified as a positive integer.

Example: `datasample(data,100)` returns 100 observations sampled uniformly and at random from the data in `data`.

Data Types: `single` | `double`

Dimension to sample, specified as a positive integer. For example, if `data` is a matrix and `dim` is `2`, `y` contains a selection of columns in `data`. If `data` is a table or dataset array and `dim` is `2`, `y` contains a selection of variables in `data`. Use `dim` to ensure sampling along a specific dimension regardless of whether `data` is a vector, matrix, or N-dimensional array.

Data Types: `single` | `double`

Random number stream, specified as the global stream or `RandStream`. For example, `s = RandStream('mlfg6331_64')` creates a random number stream that uses the multiplicative lagged Fibonacci generator algorithm. For details, see Creating and Controlling a Random Number Stream.

The `rng` function provides a simple way to control the global stream. For example, `rng(seed)` seeds the random number generator using the nonnegative integer seed. For details, see Managing the Global Stream Using RandStream.

### Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside quotes. You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

Example: `'Replace',false,'Weights',ones(datasize,1)` samples without replacement and with probability proportional to the elements of `Weights`, where `datasize` is the size of the dimension being sampled.

Indicator for sampling with replacement, specified as the comma-separated pair consisting of `'Replace'` and either `true` or `false`.

Sample with replacement if `'Replace'` is `true`, or without replacement if `'Replace'` is `false`. If `'Replace'` is `false`, then `k` must not be larger than the size of the dimension being sampled. For example, if ```data = [1 3 Inf; 2 4 5]``` and `y = datasample(data,k,'Replace',false)`, then `k` cannot be larger than `2`.

Data Types: `logical`

Sampling weights, specified as the comma-separated pair consisting of `'Weights'` and a vector of nonnegative numeric values. The vector is of size `datasize`, where `datasize` is the size of the dimension being sampled. The vector must have at least one positive value and cannot contain `NaN` values. The `datasample` function samples with probability proportional to the elements of `'Weights'`.

Example: ```'Weights',[0.1 0.5 0.35 0.46]```

Data Types: `single` | `double`

## Output Arguments

collapse all

Sample, returned as a vector, matrix, multidimensional array, table, or dataset array.

• If `data` is a vector, then `y` is a vector containing `k` elements selected from `data`.

• If `data` is a matrix and `dim` = `1`, then `y` is a matrix containing `k` rows selected from `data`. Or, if `dim` = `2`, then `y` is a matrix containing `k` columns selected from `data`.

• If `data` is an N-dimensional array and `dim` = `1`, then `y` is an N-dimensional array of samples taken along the first nonsingleton dimension of `data`. Or, if you specify a value for the `dim` name-value pair argument, `datasample` samples along the dimension `dim`.

• If `data` is a table and `dim` = `1`, then `y` is a table containing `k` rows selected from `data`. Or, if `dim` = `2`, then `y` is a table containing `k` variables selected from `data`.

• If `data` is a dataset array and `dim` = `1`, then `y` is a dataset array containing `k` rows selected from `data`. Or, if `dim` = `2`, then `y` is a dataset array containing `k` variables selected from `data`.

If the input `data` contains missing observations that are represented as `NaN` values, `datasample` samples from the entire input, including the `NaN` values. For example, `y = datasample([NaN 6 14],2)` can return `y = NaN 14`.

When the sample is taken with replacement (default), `y` can contain repeated observations from `data`. Set the `Replace` name-value pair argument to `false` to sample without replacement.

Indices, returned as a vector indicating which elements `datasample` chooses from `data` to create `y`. For example:

• If `data` is a vector, then `y = data(idx)`.

• If `data` is a matrix and `dim` = `1`, then `y = data(idx,:)`.

• If `data` is a matrix and `dim` = `2`, then `y = data(:,idx)`.

## Tips

• To sample random integers with replacement from a range, use `randi`.

• To sample random integers without replacement, use `randperm` or `datasample`.

• To randomly sample from data, with or without replacement, use `datasample`.

## Algorithms

`datasample` uses `randperm`, `rand`, or `randi` to generate random values. Therefore, `datasample` changes the state of the MATLAB® global random number generator. Control the random number generator using `rng`.

For selecting weighted samples without replacement, `datasample` uses the algorithm of Wong and Easton .

## Alternative Functionality

You can use `randi` or `randperm` to generate indices for random sampling with or without replacement, respectively. However, `datasample` can be more convenient to use because it samples directly from your data. `datasample` also allows weighted sampling.

 Wong, C. K. and M. C. Easton. An Efficient Method for Weighted Sampling Without Replacement. SIAM Journal of Computing 9(1), pp. 111–113, 1980.