rougeEvaluationScore

Evaluate translation or summarization with ROUGE similarity score

Syntax

score = rougeEvaluationScore(candidate,references)

score = rougeEvaluationScore(candidate,references,Name,Value)

Description

The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scoring algorithm evaluates the similarity between a candidate document and a collection of reference documents. Use the ROUGE score to evaluate the quality of document translation and summarization models.

score = rougeEvaluationScore(candidate,references) returns the ROUGE score between the specified candidate document and the reference documents. The function, by default, computes unigram overlaps between candidate and references. This is also known as the ROUGE-N metric with n-gram length 1. For more information, see ROUGE Score.

example

score = rougeEvaluationScore(candidate,references,Name,Value) specifies additional options using one or more name-value pairs.

example

Examples

collapse all

Evaluate Similarity

Open Live Script

Specify the candidate document as a tokenizedDocument object.

str = "the fast brown fox jumped over the lazy dog";
candidate = tokenizedDocument(str)

candidate = 
  tokenizedDocument:

   9 tokens: the fast brown fox jumped over the lazy dog

Specify the reference documents as a tokenizedDocument array.

str = [
    "the quick brown animal jumped over the lazy dog"
    "the quick brown fox jumped over the lazy dog"];
references = tokenizedDocument(str)

references = 
  2×1 tokenizedDocument:

    9 tokens: the quick brown animal jumped over the lazy dog
    9 tokens: the quick brown fox jumped over the lazy dog

Calculate the ROUGE score between the candidate document and the reference documents.

score = rougeEvaluationScore(candidate,references)

score = 
0.8889

Specify N-Gram Lengths

Open Live Script

Specify the candidate document as a tokenizedDocument object.

str = "a simple summary document containing some words";
candidate = tokenizedDocument(str)

candidate = 
  tokenizedDocument:

   7 tokens: a simple summary document containing some words

Specify the reference documents as a tokenizedDocument array.

str = [
    "a simple document"
    "another document with some words"];
references = tokenizedDocument(str)

references = 
  2×1 tokenizedDocument:

    3 tokens: a simple document
    5 tokens: another document with some words

Calculate the ROUGE score between the candidate document and the reference documents using the default options.

score = rougeEvaluationScore(candidate,references)

score = 
1

The rougeEvaluationScore function, by default, compares unigram (single-token) overlaps between the candidate document and the reference documents. Because the ROUGE score is a recall-based measure, if one of the reference documents is made up entirely of unigrams that appear in the candidate document, the resulting ROUGE score is one. In this scenario, the output of the rougeEvaluationScore function is uninformative.

For a more meaningful result, calculate the ROUGE score again using bigrams by setting the 'NgramLength' option to 2. The resulting score is less than one, since every reference document contains bigrams that do not appear in the candidate document.

score = rougeEvaluationScore(candidate,references,'NgramLength',2)

score = 
0.5000

Input Arguments

collapse all

`candidate` — Candidate document
`tokenizedDocument` scalar | string array | cell array of character vectors

Candidate document, specified as a tokenizedDocument scalar, a string array, or a cell array of character vectors. If candidate is not a tokenizedDocument scalar, then it must be a row vector representing a single document, where each element is a word.

`references` — Reference documents
`tokenizedDocument` array | string array | cell array of character vectors

Reference documents, specified as a tokenizedDocument array, a string array, or a cell array of character vectors. If references is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To evaluate against multiple reference documents, use a tokenizedDocument array.

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: scores = rougeEvaluationScore(candidate,references,'ROUGEMethod','weighted-subsequences') specifies to use the weighted subsequences ROUGE method.

`ROUGEMethod` — ROUGE method
`'n-grams'` (default) | `'longest-common-subsequences'` | `'weighted-subsequences'` | `'skip-bigrams'` | `'skip-bigrams-and-unigrams'`

ROUGE method, specified as the comma-separated pair consisting of 'ROUGEMethod' and one of the following:

'n-grams' – Evaluate the ROUGE score using n-gram overlaps between the candidate document and the reference documents. This is also known as the ROUGE-N metric.
'longest-common-subsequences' – Evaluate the ROUGE score using Longest Common Subsequence (LCS) statistics. This is also known as the ROUGE-L metric.
'weighted-subsequences' – Evaluate the ROUGE score using weighted longest common subsequence statistics. This method favors consecutive LCSs. This is also known as the ROUGE-W metric.
'skip-bigrams' – Evaluate the ROUGE score using skip-bigram (any pair of words in sentence order) co-occurrence statistics. This is also known as the ROUGE-S metric.
'skip-bigrams-and-unigrams' – Evaluate the ROUGE score using skip-bigram and unigram co-occurrence statistics. This is also known as the ROUGE-SU metric.

`NgramLength` — N-gram length
1 (default) | positive integer

N-gram length used for the 'n-grams' ROUGE method (ROUGE-N), specified as the comma-separated pair consisting of 'NgramLength' and a positive integer.

If the 'ROUGEMethod' option is not 'n-grams', then the 'NgramLength' option has no effect.

Tip

If the longest document in references has fewer than NgramLength words, then the resulting ROUGE score is NaN. If candidate has fewer than NgramLength words, then the resulting ROUGE score is zero. To ensure that rougeEvaluationScore returns nonzero scores for very short documents, set NgramLength to a positive integer smaller than the length of candidate and the length of the longest document in references.

`SkipDistance` — Skip distance
4 (default) | positive integer

Skip distance used for the 'skip-bigrams' and 'skip-bigrams-and-unigrams' ROUGE methods (ROUGE-S and ROUGE-SU), specified as the comma-separated pair consisting of 'SkipDistance' and a positive integer.

If the 'ROUGEMethod' option is not 'skip-bigrams' or 'skip-bigrams-and-unigrams', then the 'SkipDistance' option has no effect.

Output Arguments

collapse all

`score` — ROUGE score
scalar

ROUGE score, returned as a scalar value in the range [0,1] or NaN.

A ROUGE score close to zero indicates poor similarity between candidate and references. A ROUGE score close to one indicates strong similarity between candidate and references. If candidate is identical to one of the reference documents, then score is 1. If candidate and references are both empty documents, then the resulting ROUGE score is NaN.

Tip

Algorithms

collapse all

ROUGE Score

The Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scoring algorithm [1] calculates the similarity between a candidate document and a collection of reference documents. Use the ROUGE score to evaluate the quality of document translation and summarization models.

N-gram Co-Occurrence Statistics (ROUGE-N)

Given an n-gram length n, the ROUGE-N metric between a candidate document and a single reference document is given by

${ROUGE-N}_{single} (candidate, reference) = \frac{\sum_{r_{i} \in reference} \sum_{n-gram \in r_{i}} Count (n-gram, candidate)}{\sum_{r_{i} \in reference} numNgrams (r_{i})},$

where the elements r_i are sentences in the reference document, $Count (n-gram, candidate)$ is the number of times the specified n-gram occurs in the candidate document and numNgrams(r_i) is the number of n-grams in the specified reference sentence r_i.

For sets of multiple reference documents, the ROUGE-N metric is given by

$ROUGE-N(candidate, references) = m a x_{k} {{ROUGE-N}_{single} ({candidate, references}_{k})} .$

To use the ROUGE-N metric, set the 'ROUGEMethod' option to 'n-grams'.

Longest Common Subsequence (ROUGE-L)

Given a sentence $d = [w_{1}, \dots, w_{m}]$ and a sentence s, where the elements s_i correspond to words, the subsequence $[w_{i_{1}}, \dots, w_{i_{k}}]$ is a common subsequence of d and s if $w_{i_{j}^{'}} \in {s_{1}, \dots, s_{n}}$ for $j = 1, \dots, k$ and $i_{1} < \dots < i_{k}$ , where the elements of s are the words of the sentence and k is the length of the subsequence. The subsequence $[w_{i_{1}}, \dots, w_{i_{k}}]$ is a longest common subsequence (LCS) if the subsequence length k is maximal.

Given a candidate document and a single reference document the union of the longest common subsequences is given by

$L C S_{\cup} (candidate, reference) = \underset{r_{i} \in reference}{\cup} {w | w \in LCS (candidate, r_{i})},$

where $LCS (candidate, r_{i})$ is the set of longest common subsequences in the candidate document and the sentence r_i from a reference document.

The ROUGE-L metric is an F-score measure. To calculate it, first calculate the recall and precision scores given by

$R_{lcs} (candidate, reference) = \frac{\sum_{r_{i} \in reference} | {LCS}_{\cup} ({candidate,r}_{i}) |}{numWords (reference)}$

$P_{lcs} (candidate, reference) = \frac{\sum_{r_{i} \in reference} | {LCS}_{\cup} ({candidate,r}_{i}) |}{numWords (candidate)} .$

Then, the ROUGE-L metric between a candidate document and a single reference document is given by the F-score measure

${ROUGE-L}_{single} (candidate, reference) = \frac{(1 + β^{2}) R_{lcs} (candidate, reference) P_{lcs} (candidate, reference)}{R_{lcs} (candidate, reference) + β^{2} P_{lcs} (candidate, reference)},$

where the parameter $β$ controls the relative importance of the precision and recall. Because the ROUGE score favors recall, $β$ is typically set to a high value.

For sets of multiple reference documents, the ROUGE-L metric is given by

$ROUGE-L(candidate, references) = m a x_{k} {{ROUGE-L}_{single} ({candidate, references}_{k})} .$

To use the ROUGE-L metric, set the 'ROUGEMethod' option to 'longest-common-subsequences'.

Weighted Longest Common Subsequence (ROUGE-W)

Given a weighting function f such that f has the property f(x+y)>f(x)+f(y) for any positive integers x and y, define $WLCS (candidate, reference)$ to be the length of the longest consecutive matches encountered in the candidate document and a single reference document scored by the weighting function f. For more information about calculating this value, see [1].

The ROUGE-W is metric given an F-score measure which requires the recall and precision scores given by

$R_{wlcs} (candidate, reference) = f^{- 1} (\frac{WLCS (candidate, reference)}{f (numWords (reference)})$

$P_{wlcs} (candidate, reference) = f^{- 1} (\frac{WLCS (candidate, reference)}{f (numWords (candidate))}) .$

The ROUGE-W metric between a candidate document and a single reference document is given by the F-score measure

${ROUGE-W}_{single} (candidate, reference) = \frac{(1 + β^{2}) R_{wlcs} (candidate, reference) P_{wlcs} (candidate, reference)}{R_{wlcs} (candidate, reference) + β^{2} P_{wlcs} (candidate, reference)},$

where the parameter $β$ controls the relative importance of the precision and recall. Because the ROUGE score favors recall, $β$ is typically set to a high value.

For multiple reference documents, the ROUGE-W metric is given by

$ROUGE-W(candidate, references) = m a x_{k} {{ROUGE-W}_{single} ({candidate, references}_{k})} .$

To use the ROUGE-W metric, set the 'ROUGEMethod' option to 'weighted-longest-common-subsequences'.

Skip-Bigram Co-Occurrence Statistics (ROUGE-S)

A skip-bigram is an ordered pair of words in a sentence allowing for arbitrary gaps between them. That is, given a sentence $c_{i} = [c_{i 1}, \dots, c_{i m}]$ from a candidate document, where the elements c_ij correspond to the words in the sentence, the pair of words $[c_{i j_{1}^{'}}, c_{i j_{2}^{'}}]$ is a skip-bigram if $j_{1}^{'} < j_{2}^{'}$ .

The ROUGE-S metric is an F-score measure. To calculate it, first calculate the recall and precision scores given by

$R_{skip2} (candidate, reference) = \frac{\sum_{r_{i} \in reference} \sum_{skip-bigram \in r_{i}} Count (skip-bigram, candidate)}{\sum_{r_{i} \in reference} numSkipBigrams (r_{i})}$

$P_{skip2} (candidate, reference) = \frac{\sum_{r_{i} \in reference} \sum_{skip-bigram \in r_{i}} Count (skip-bigram, candidate)}{\sum_{c_{i} \in candidate} numSkipBigrams (c_{i})} .$

where the elements r_i and c_i are sentences in the reference document and candidate document, respectively, $Count (skip-bigram, candidate)$ is the number of times the specified skip-bigram occurs in the candidate document, and numSkipBigrams(s) is the number of skip-bigrams in the sentence s.

Then, the ROUGE-S metric between a candidate document and a single reference document is given by the F-score measure

${ROUGE-S}_{single} (candidate, reference) = \frac{(1 + β^{2}) R_{skip2} (candidate, reference) P_{skip2} (candidate, reference)}{R_{skip2} (candidate, reference) + β^{2} P_{skip2} (candidate, reference)},$

For sets of multiple reference documents, the ROUGE-S metric is given by

$ROUGE-S(candidate, references) = m a x_{k} {{ROUGE-S}_{single} ({candidate, references}_{k})} .$

To use the ROUGE-S metric, set the 'ROUGEMethod' option to 'skip-bigrams'.

Skip-Bigram and Unigram Co-Occurrence Statistics (ROUGE-SU)

To also include unigram co-occurrence statistics in the ROUGE-S metric, introduce unigram counts into the recall and precision scores for ROUGE-S. This is equivalent to including start tokens in the candidate and reference documents, since

$\sum_{skip-bigram \in r_{i}} (Count (skip-bigram, candidate)) + \sum_{unigram \in r_{i}} (Count (unigram, candidate) = \sum_{skip-bigram \in r_{i}^{+}} (Count (skip-bigram, {candidate}^{+})),$

where Count(unigram,candidate) is the number of times the specified unigram appears in the candidate document, and $r_{i}^{+}$ and ${candidate}^{+}$ denote the reference sentence and the candidate document augmented with start tokens, respectively.

For sets of multiple reference documents, the ROUGE-SU metric is given by

$ROUGE-SU(candidate, references) = m a x_{k} {{ROUGE-S}_{single} ({candidate}^{+} {, references}_{k}^{+})},$

where ${reference}^{+}$ is the reference document with sentences augmented with start tokens.

To use the ROUGE-SU metric, set the 'ROUGEMethod' option to 'skip-bigrams-and-unigrams'.

References

[1] Lin, Chin-Yew. "Rouge: A package for automatic evaluation of summaries." In Text Summarization Branches Out, pp. 74-81. 2004.

Version History

Introduced in R2020a

rougeEvaluationScore

Syntax

Description

Examples

Evaluate Similarity

Specify N-Gram Lengths

Input Arguments

`candidate` — Candidate document
`tokenizedDocument` scalar | string array | cell array of character vectors

`references` — Reference documents
`tokenizedDocument` array | string array | cell array of character vectors

Name-Value Arguments

`ROUGEMethod` — ROUGE method
`'n-grams'` (default) | `'longest-common-subsequences'` | `'weighted-subsequences'` | `'skip-bigrams'` | `'skip-bigrams-and-unigrams'`

`NgramLength` — N-gram length
1 (default) | positive integer

`SkipDistance` — Skip distance
4 (default) | positive integer

Output Arguments

`score` — ROUGE score
scalar

Algorithms

ROUGE Score

References

Version History

See Also

Topics

rougeEvaluationScore

Syntax

Description

Examples

Evaluate Similarity

Specify N-Gram Lengths

Input Arguments

candidate — Candidate document tokenizedDocument scalar | string array | cell array of character vectors

references — Reference documents tokenizedDocument array | string array | cell array of character vectors

Name-Value Arguments

ROUGEMethod — ROUGE method 'n-grams' (default) | 'longest-common-subsequences' | 'weighted-subsequences' | 'skip-bigrams' | 'skip-bigrams-and-unigrams'

NgramLength — N-gram length 1 (default) | positive integer

SkipDistance — Skip distance 4 (default) | positive integer

Output Arguments

score — ROUGE score scalar

Algorithms

ROUGE Score

References

Version History

See Also

Topics

`candidate` — Candidate document
`tokenizedDocument` scalar | string array | cell array of character vectors

`references` — Reference documents
`tokenizedDocument` array | string array | cell array of character vectors

`ROUGEMethod` — ROUGE method
`'n-grams'` (default) | `'longest-common-subsequences'` | `'weighted-subsequences'` | `'skip-bigrams'` | `'skip-bigrams-and-unigrams'`

`NgramLength` — N-gram length
1 (default) | positive integer

`SkipDistance` — Skip distance
4 (default) | positive integer

`score` — ROUGE score
scalar