Fastest way to find text keywords out of large amount of textual news sentences?

6 visualizzazioni (ultimi 30 giorni)
Hello, I have a database containing over 900,000 line of news. And I want to scan these lines of texts for certain keyword. I tried
tic; strfind(newsDb.SingleNewline, kws{1}); toc
tic; contains(newsDb.SingleNewline, kws{1}); toc
both takes over 0.003 sec for search in one keyword in one news line.
If I want to create a new database with over 20,000 keywords, then it would take
900000 * 20000 * 0.003 / 60 / 60 / 24
over 600 days to do this. :(
Anyone has perhaps an idea how to to this within perhaps one-two day?
Thank you very much
  6 Commenti
Walter Roberson
Walter Roberson il 28 Gen 2021
What do you want to do about substrings, and plurals, and upper/lowercase and the other factors I asked about? For example if the headline were "Elon visits Oak Hammock Marsh" then is it acceptable that this would match "Mars" ? And "Elon eats musk-melon" ? And "Eucre trumps Bridge in recent poll" ?

Accedi per commentare.

Risposta accettata

Walter Roberson
Walter Roberson il 8 Feb 2021
You can do the search phase efficiently:
S = [ "Elon Musk is the richest man on the planet"
"Elon Musk is the poorest man on Mars"
"Trump is the president of US"
"Elon eats musk-melon"
"Eucre Trumps Bridge in recent poll"
"Trump is the not president of US"]
S = 6×1 string array
"Elon Musk is the richest man on the planet" "Elon Musk is the poorest man on Mars" "Trump is the president of US" "Elon eats musk-melon" "Eucre Trumps Bridge in recent poll" "Trump is the not president of US"
Tags = ["Musk" "Trump" "Mars"]
Tags = 1×3 string array
"Musk" "Trump" "Mars"
numTags = length(Tags);
pattern = "\<(?<word>(" + strjoin(Tags, "|") + "))\>"
pattern = "\<(?<word>(Musk|Trump|Mars))\>"
search_results = regexp(S, pattern, 'names')
search_results = 6x1 cell array
{1×1 struct} {1×2 struct} {1×1 struct} {0×0 struct} {0×0 struct} {1×1 struct}
However, the output is not really what you want: it is information about each tag that was matched for each cell, and needs to re-arranged to give information about where each tag was found.
tags_matched = cellfun(@(C) string({C.word}), search_results, 'uniform', 0).'
tags_matched = 1x6 cell array
{["Musk"]} {1×2 string} {["Trump"]} {0×0 string} {0×0 string} {["Trump"]}
TagWasFoundAt = cell(numTags,1);
for K = 1 : numTags; TagWasFoundAt{K} = find(cellfun(@(C) ismember(Tags{K}, C), tags_matched)); end
[cellstr(Tags(:)), TagWasFoundAt]
ans = 3x2 cell array
{'Musk' } {1×2 double} {'Trump'} {1×2 double} {'Mars' } {[ 2]}
%OR
match_bits = cell2mat(cellfun(@(C) ismember(Tags, string({C.word})), search_results, 'uniform', 0));
TagWasFoundAt = arrayfun(@(COL) find(match_bits(:,COL)).', (1:numTags).', 'uniform', 0);
[cellstr(Tags(:)), TagWasFoundAt]
ans = 3x2 cell array
{'Musk' } {1×2 double} {'Trump'} {1×2 double} {'Mars' } {[ 2]}
It is likely that there are other ways to do the matching from tags to entries.
The first of those two is probably more efficient, but the match_bits array would be useful if you wanted a single data structure that you could easily query to find out which articles contain a particular tag, or which tags a particular article contains. The match_bits array is good for doing boolean searches, for example, such as trying to find articles that contain Musk Or Mars but not Trump
(match_bits(:,1) | match_bits(:,3)) & ~match_bits(:,2)
ans = 6x1 logical array
1 1 0 0 0 0
There might be better ways of doing the matching.

Più risposte (0)

Categorie

Scopri di più su Characters and Strings in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by