Main Content

mmrScores

Document scoring with Maximal Marginal Relevance (MMR) algorithm

Since R2020a

Description

example

scores = mmrScores(documents,queries) scores documents according to their relevance to a queries avoiding redundancy using the MMR algorithm. The score in scores(i,j) is the MMR score of documents(i) relative to queries(j).

scores = mmrScores(bag,queries) scores documents encoded by the bag-of-words or bag-of-n-grams model bag relative to queries. The score in scores(i,j) is the MMR score of the ith document in bag relative to queries(j).

scores = mmrScores(___,lambda) also specifies the trade off between relevance and redundancy.

Examples

collapse all

Create an array of input documents.

str = [
    "the quick brown fox jumped over the lazy dog"
    "the fast fox jumped over the lazy dog"
    "the dog sat there and did nothing"
    "the other animals sat there watching"];
documents = tokenizedDocument(str)
documents = 
  4x1 tokenizedDocument:

    9 tokens: the quick brown fox jumped over the lazy dog
    8 tokens: the fast fox jumped over the lazy dog
    7 tokens: the dog sat there and did nothing
    6 tokens: the other animals sat there watching

Create an array of query documents.

str = [
    "a brown fox leaped over the lazy dog"
    "another fox leaped over the dog"];
queries = tokenizedDocument(str)
queries = 
  2x1 tokenizedDocument:

    8 tokens: a brown fox leaped over the lazy dog
    6 tokens: another fox leaped over the dog

Calculate MMR scores using the mmrScores function. The output is a sparse matrix.

scores = mmrScores(documents,queries);

Visualize the MMR scores in a heat map.

figure
heatmap(scores);
xlabel("Query Document")
ylabel("Input Document")
title("MMR Scores")

Higher scores correspond to stronger relevance to the query documents.

Create an array of input documents.

str = [
    "the quick brown fox jumped over the lazy dog"
    "the quick brown fox jumped over the lazy dog"
    "the fast fox jumped over the lazy dog"
    "the dog sat there and did nothing"
    "the other animals sat there watching"
    "the other animals sat there watching"];
documents = tokenizedDocument(str);

Create a bag-of-words model from the input documents.

bag = bagOfWords(documents)
bag = 
  bagOfWords with properties:

          Counts: [6x17 double]
      Vocabulary: ["the"    "quick"    "brown"    "fox"    "jumped"    "over"    "lazy"    "dog"    "fast"    "sat"    "there"    "and"    "did"    "nothing"    "other"    "animals"    "watching"]
        NumWords: 17
    NumDocuments: 6

Create an array of query documents.

str = [
    "a brown fox leaped over the lazy dog"
    "another fox leaped over the dog"];
queries = tokenizedDocument(str)
queries = 
  2x1 tokenizedDocument:

    8 tokens: a brown fox leaped over the lazy dog
    6 tokens: another fox leaped over the dog

Calculate the MMR scores. The output is a sparse matrix.

scores = mmrScores(bag,queries);

Visualize the MMR scores in a heat map.

figure
heatmap(scores);
xlabel("Query Document")
ylabel("Input Document")
title("MMR Scores")

Now calculate the scores again, and set the lambda value to 0.01. When the lambda value is close to 0, redundant documents yield lower scores and diverse (but less query-relevant) documents yield higher scores.

lambda = 0.01;
scores = mmrScores(bag,queries,lambda);

Visualize the MMR scores in a heat map.

figure
heatmap(scores);
xlabel("Query Document")
ylabel("Input Document")
title("MMR Scores, lambda = " + lambda)

Finally, calculate the scores again and set the lambda value to 1. When the lambda value is 1, the query-relevant documents yield higher scores despite other documents yielding high scores.

lambda = 1;
scores = mmrScores(bag,queries,lambda);

Visualize the MMR scores in a heat map.

figure
heatmap(scores);
xlabel("Query Document")
ylabel("Input Document")
title("MMR Scores, lambda = " + lambda)

Input Arguments

collapse all

Input documents, specified as a tokenizedDocument array, a string array of words, or a cell array of character vectors. If documents is not a tokenizedDocument array, then it must be a row vector representing a single document, where each element is a word. To specify multiple documents, use a tokenizedDocument array.

Input bag-of-words or bag-of-n-grams model, specified as a bagOfWords object or a bagOfNgrams object. If bag is a bagOfNgrams object, then the function treats each n-gram as a single word.

Set of query documents, specified as one of the following:

  • A tokenizedDocument array

  • A 1-by-N string array representing a single document, where each element is a word

  • A 1-by-N cell array of character vectors representing a single document, where each element is a word

To compute term frequency and inverse document frequency statistics, the function encodes queries using a bag-of-words model. The model it uses depends on the syntax you call it with. If your syntax specifies the input argument documents, then it uses bagOfWords(documents). If your syntax specifies bag, then the function encodes queries using bag then uses the resulting tf-idf matrix.

Trade off between relevance and redundancy, specified as a nonnegative scalar.

When lambda is close to 0, redundant documents yield lower scores and diverse (but less query-relevant) documents yield higher scores. If lambda is 1, then query-relevant documents yield higher scores despite other documents yielding high scores.

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Output Arguments

collapse all

MMR scores, returned as an N1-by-N2 matrix, where scores(i,j) is the MMR score of documents(i) relative to jth query document, and N1 and N2 are the number of input and query documents, respectively.

A document has a high MMR score if it is both relevant to the query and has minimal similarity relative to the other documents.

References

[1] Carbonell, Jaime G., and Jade Goldstein. "The use of MMR, diversity-based reranking for reordering documents and producing summaries." In SIGIR, vol. 98, pp. 335-336. 1998.

Version History

Introduced in R2020a