Cosine Similarity using BERT

10 visualizzazioni (ultimi 30 giorni)
Nicholas Ang
Nicholas Ang il 30 Giu 2021
Commentato: Nicholas Ang il 30 Giu 2021
I am using BERT to calculate similarities in Question Answering. I have encoded my Question data using
data.Tokens = encode(mdl.Tokenizer,data.Questions) which returns me a cell array.
Next, I proceeded to encode new text to test the similiarity with the already encoded Questions in the database: testTokens = encode(mdl.Tokenizer,text)
However, I am imable to use the cosineSimilarity(data.Tokens,testTokens) and I receive an error that says:
Input must be a matrix, a tokenizedDocument array, a bagOfWords model, a bagOfNgrams model, a string array of words, or a cell array of character vectors.
Do I need padding here or reshape of my cell vectors?

Risposta accettata

Divyam Gupta
Divyam Gupta il 30 Giu 2021
Hi Nicholas, I notice that you're facing an issue while computing the cosine similarity using a text encoder. As per the documentation mentioned at https://www.mathworks.com/help/textanalytics/ref/cosinesimilarity.html#d123e8335 the cosineSimilarity function takes a matrix to compute the similarity between two documents.
Since the encoded vector sizes for each of the questions is different, constructing a matrix might be difficult. You can do a pairwise comparision between the data.Tokens and the testTokens to compute the similarities. This can be achieved by running a nested loop while simultaneously storing the similarity scores.
Hope this helps.

Più risposte (0)

Prodotti


Release

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by