Fastest way to find text keywords out of large amount of textual news sentences?
6 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Song Decn
il 24 Gen 2021
Risposto: Walter Roberson
il 8 Feb 2021
Hello, I have a database containing over 900,000 line of news. And I want to scan these lines of texts for certain keyword. I tried
tic; strfind(newsDb.SingleNewline, kws{1}); toc
tic; contains(newsDb.SingleNewline, kws{1}); toc
both takes over 0.003 sec for search in one keyword in one news line.
If I want to create a new database with over 20,000 keywords, then it would take
900000 * 20000 * 0.003 / 60 / 60 / 24
over 600 days to do this. :(
Anyone has perhaps an idea how to to this within perhaps one-two day?
Thank you very much
6 Commenti
Walter Roberson
il 28 Gen 2021
What do you want to do about substrings, and plurals, and upper/lowercase and the other factors I asked about? For example if the headline were "Elon visits Oak Hammock Marsh" then is it acceptable that this would match "Mars" ? And "Elon eats musk-melon" ? And "Eucre trumps Bridge in recent poll" ?
Risposta accettata
Walter Roberson
il 8 Feb 2021
You can do the search phase efficiently:
S = [ "Elon Musk is the richest man on the planet"
"Elon Musk is the poorest man on Mars"
"Trump is the president of US"
"Elon eats musk-melon"
"Eucre Trumps Bridge in recent poll"
"Trump is the not president of US"]
Tags = ["Musk" "Trump" "Mars"]
numTags = length(Tags);
pattern = "\<(?<word>(" + strjoin(Tags, "|") + "))\>"
search_results = regexp(S, pattern, 'names')
However, the output is not really what you want: it is information about each tag that was matched for each cell, and needs to re-arranged to give information about where each tag was found.
tags_matched = cellfun(@(C) string({C.word}), search_results, 'uniform', 0).'
TagWasFoundAt = cell(numTags,1);
for K = 1 : numTags; TagWasFoundAt{K} = find(cellfun(@(C) ismember(Tags{K}, C), tags_matched)); end
[cellstr(Tags(:)), TagWasFoundAt]
%OR
match_bits = cell2mat(cellfun(@(C) ismember(Tags, string({C.word})), search_results, 'uniform', 0));
TagWasFoundAt = arrayfun(@(COL) find(match_bits(:,COL)).', (1:numTags).', 'uniform', 0);
[cellstr(Tags(:)), TagWasFoundAt]
It is likely that there are other ways to do the matching from tags to entries.
The first of those two is probably more efficient, but the match_bits array would be useful if you wanted a single data structure that you could easily query to find out which articles contain a particular tag, or which tags a particular article contains. The match_bits array is good for doing boolean searches, for example, such as trying to find articles that contain Musk Or Mars but not Trump
(match_bits(:,1) | match_bits(:,3)) & ~match_bits(:,2)
There might be better ways of doing the matching.
0 Commenti
Più risposte (0)
Vedere anche
Categorie
Scopri di più su Characters and Strings in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!