# templateNaiveBayes

Naive Bayes classifier template

## Syntax

``t = templateNaiveBayes()``
``t = templateNaiveBayes(Name,Value)``

## Description

example

````t = templateNaiveBayes()` returns a naive Bayes template suitable for training error-correcting output code (ECOC) multiclass models. If you specify a default template, then the software uses default values for all input arguments during training.Specify `t` as a learner in `fitcecoc`.```

example

````t = templateNaiveBayes(Name,Value)` returns a template with additional options specified by one or more name-value pair arguments. All properties of `t` are empty, except those you specify using `Name,Value` pair arguments.For example, you can specify distributions for the predictors.If you display `t` in the Command Window, then all options appear empty (`[]`), except those that you specify using name-value pair arguments. During training, the software uses default values for empty options.```

## Examples

collapse all

Use `templateNaiveBayes` to specify a default naive Bayes template.

`t = templateNaiveBayes()`
```t = Fit template for classification NaiveBayes. DistributionNames: [1x0 double] Kernel: [] Support: [] Width: [] Version: 1 Method: 'NaiveBayes' Type: 'classification' ```

All properties of the template object are empty except for `Method` and `Type`. When you pass `t` to the training function, the software fills in the empty properties with their respective default values. For example, the software fills the `DistributionNames` property with a 1-by- `D` cell array of character vectors with `'normal'` in each cell, where `D` is the number of predictors. For details on other default values, see `fitcnb`.

`t` is a plan for a naive Bayes learner, and no computation occurs when you specify it. You can pass `t` to `fitcecoc` to specify naive Bayes binary learners for ECOC multiclass learning.

Create a nondefault naive Bayes template for use in `fitcecoc`.

`load fisheriris`

Create a template for naive Bayes binary classifiers, and specify kernel distributions for all predictors.

`t = templateNaiveBayes('DistributionNames','kernel')`
```t = Fit template for classification NaiveBayes. DistributionNames: 'kernel' Kernel: [] Support: [] Width: [] Version: 1 Method: 'NaiveBayes' Type: 'classification' ```

All properties of the template object are empty except for `DistributionNames`, `Method`, and `Type`. When you pass `t` to the training function, the software fills in the empty properties with their respective default values.

Specify `t` as a binary learner for an ECOC multiclass model.

`Mdl = fitcecoc(meas,species,'Learners',t);`

By default, the software trains `Mdl` using the one-versus-one coding design.

Display the in-sample (resubstitution) misclassification error.

`L = resubLoss(Mdl,'LossFun','classiferror')`
```L = 0.0333 ```

## Input Arguments

collapse all

### Name-Value Pair Arguments

Specify optional comma-separated pairs of `Name,Value` arguments. `Name` is the argument name and `Value` is the corresponding value. `Name` must appear inside quotes. You can specify several name and value pair arguments in any order as `Name1,Value1,...,NameN,ValueN`.

Example: `'DistributionNames','mn'` specifies to treat all predictors as token counts for a multinomial model.

Data distributions `fitcnb` uses to model the data, specified as the comma-separated pair consisting of `'DistributionNames'` and a character vector or string scalar, a string array, or a cell array of character vectors with values from this table.

ValueDescription
`'kernel'`Kernel smoothing density estimate.
`'mn'`Multinomial distribution. If you specify `mn`, then all features are components of a multinomial distribution. Therefore, you cannot include `'mn'` as an element of a string array or a cell array of character vectors. For details, see Algorithms.
`'mvmn'`Multivariate multinomial distribution. For details, see Algorithms.
`'normal'`Normal (Gaussian) distribution.

If you specify a character vector or string scalar, then the software models all the features using that distribution. If you specify a 1-by-P string array or cell array of character vectors, then the software models feature j using the distribution in element j of the array.

By default, the software sets all predictors specified as categorical predictors (using the `CategoricalPredictors` name-value pair argument) to `'mvmn'`. Otherwise, the default distribution is `'normal'`.

You must specify that at least one predictor has distribution `'kernel'` to additionally specify `Kernel`, `Support`, or `Width`.

Example: `'DistributionNames','mn'`

Example: `'DistributionNames',{'kernel','normal','kernel'}`

Kernel smoother type, specified as the comma-separated pair consisting of `'Kernel'` and a character vector or string scalar, a string array, or a cell array of character vectors.

This table summarizes the available options for setting the kernel smoothing density region. Let I{u} denote the indicator function.

ValueKernelFormula
`'box'`Box (uniform)

`$f\left(x\right)=0.5I\left\{|x|\le 1\right\}$`

`'epanechnikov'`Epanechnikov

`$f\left(x\right)=0.75\left(1-{x}^{2}\right)I\left\{|x|\le 1\right\}$`

`'normal'`Gaussian

`$f\left(x\right)=\frac{1}{\sqrt{2\pi }}\mathrm{exp}\left(-0.5{x}^{2}\right)$`

`'triangle'`Triangular

`$f\left(x\right)=\left(1-|x|\right)I\left\{|x|\le 1\right\}$`

If you specify a 1-by-P string array or cell array, with each element of the array containing any value in the table, then the software trains the classifier using the kernel smoother type in element j for feature j in `X`. The software ignores elements of `Kernel` not corresponding to a predictor whose distribution is `'kernel'`.

You must specify that at least one predictor has distribution `'kernel'` to additionally specify `Kernel`, `Support`, or `Width`.

Example: `'Kernel',{'epanechnikov','normal'}`

Kernel smoothing density support, specified as the comma-separated pair consisting of `'Support'` and `'positive'`, `'unbounded'`, a string array, a cell array, or a numeric row vector. The software applies the kernel smoothing density to the specified region.

This table summarizes the available options for setting the kernel smoothing density region.

ValueDescription
1-by-2 numeric row vectorFor example, `[L,U]`, where `L` and `U` are the finite lower and upper bounds, respectively, for the density support.
`'positive'`The density support is all positive real values.
`'unbounded'`The density support is all real values.

If you specify a 1-by-P string array or cell array, with each element in the string array containing any text value in the table and each element in the cell array containing any value in the table, then the software trains the classifier using the kernel support in element j for feature j in `X`. The software ignores elements of `Kernel` not corresponding to a predictor whose distribution is `'kernel'`.

You must specify that at least one predictor has distribution `'kernel'` to additionally specify `Kernel`, `Support`, or `Width`.

Example: `'KSSupport',{[-10,20],'unbounded'}`

Data Types: `char` | `string` | `cell` | `double`

Kernel smoothing window width, specified as the comma-separated pair consisting of `'Width'` and a matrix of numeric values, numeric column vector, numeric row vector, or scalar.

Suppose there are K class levels and P predictors. This table summarizes the available options for setting the kernel smoothing window width.

ValueDescription
K-by-P matrix of numeric valuesElement (k,j) specifies the width for predictor j in class k.
K-by-1 numeric column vectorElement k specifies the width for all predictors in class k.
1-by-P numeric row vectorElement j specifies the width in all class levels for predictor j.
scalarSpecifies the bandwidth for all features in all classes.

By default, the software selects a default width automatically for each combination of predictor and class by using a value that is optimal for a Gaussian distribution. If you specify `Width` and it contains `NaN`s, then the software selects widths for the elements containing `NaN`s.

You must specify that at least one predictor has distribution `'kernel'` to additionally specify `Kernel`, `Support`, or `Width`.

Example: `'Width',[NaN NaN]`

Data Types: `double` | `struct`

## Output Arguments

collapse all

Naive Bayes classification template suitable for training error-correcting output code (ECOC) multiclass models, returned as a template object. Pass `t` to `fitcecoc` to specify how to create the naive Bayes classifier for the ECOC model.

If you display `t` to the Command Window, then all, unspecified options appear empty (`[]`). However, the software replaces empty options with their corresponding default values during training.

collapse all

### Naive Bayes

Naive Bayes is a classification algorithm that applies density estimation to the data.

The algorithm leverages Bayes theorem, and (naively) assumes that the predictors are conditionally independent, given the class. Though the assumption is usually violated in practice, naive Bayes classifiers tend to yield posterior distributions that are robust to biased class density estimates, particularly where the posterior is 0.5 (the decision boundary) [1].

Naive Bayes classifiers assign observations to the most probable class (in other words, the maximum a posteriori decision rule). Explicitly, the algorithm:

1. Estimates the densities of the predictors within each class.

2. Models posterior probabilities according to Bayes rule. That is, for all k = 1,...,K,

`$\stackrel{^}{P}\left(Y=k|{X}_{1},..,{X}_{P}\right)=\frac{\pi \left(Y=k\right)\prod _{j=1}^{P}P\left({X}_{j}|Y=k\right)}{\sum _{k=1}^{K}\pi \left(Y=k\right)\prod _{j=1}^{P}P\left({X}_{j}|Y=k\right)},$`

where:

• Y is the random variable corresponding to the class index of an observation.

• X1,...,XP are the random predictors of an observation.

• $\pi \left(Y=k\right)$ is the prior probability that a class index is k.

3. Classifies an observation by estimating the posterior probability for each class, and then assigns the observation to the class yielding the maximum posterior probability.

If the predictors compose a multinomial distribution, then the posterior probability$\stackrel{^}{P}\left(Y=k|{X}_{1},..,{X}_{P}\right)\propto \pi \left(Y=k\right){P}_{mn}\left({X}_{1},...,{X}_{P}|Y=k\right),$ where ${P}_{mn}\left({X}_{1},...,{X}_{P}|Y=k\right)$ is the probability mass function of a multinomial distribution.

## Algorithms

• If you specify `'DistributionNames','mn'` when training `Mdl` using `fitcnb`, then the software fits a multinomial distribution using the bag-of-tokens model. The software stores the probability that token `j` appears in class `k` in the property `DistributionParameters{k,j}`. Using additive smoothing [2], the estimated probability is

where:

• which is the weighted number of occurrences of token j in class k.

• nk is the number of observations in class k.

• ${w}_{i}^{}$ is the weight for observation i. The software normalizes weights within a class such that they sum to the prior probability for that class.

• ${c}_{k}=\sum _{j=1}^{P}{c}_{j|k};$ which is the total weighted number of occurrences of all tokens in class k.

• If you specify `'DistributionNames','mvmn'` when training `Mdl` using `fitcnb`, then:

1. For each predictor, the software collects a list of the unique levels, stores the sorted list in `CategoricalLevels`, and considers each level a bin. Each predictor/class combination is a separate, independent multinomial random variable.

2. For predictor `j` in class k, the software counts instances of each categorical level using the list stored in `CategoricalLevels{j}`.

3. The software stores the probability that predictor `j`, in class `k`, has level L in the property `DistributionParameters{k,j}`, for all levels in `CategoricalLevels{j}`. Using additive smoothing [2], the estimated probability is

where:

• which is the weighted number of observations for which predictor j equals L in class k.

• nk is the number of observations in class k.

• $I\left\{{x}_{ij}=L\right\}=1$ if xij = L, 0 otherwise.

• ${w}_{i}^{}$ is the weight for observation i. The software normalizes weights within a class such that they sum to the prior probability for that class.

• mj is the number of distinct levels in predictor j.

• mk is the weighted number of observations in class k.

## References

[1] Hastie, T., R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, Second Edition. NY: Springer, 2008.

[2] Manning, C. D., P. Raghavan, and M. Schütze. Introduction to Information Retrieval, NY: Cambridge University Press, 2008.