Fastest way to find text keywords out of large amount of textual news sentences?
Mostra commenti meno recenti
Hello, I have a database containing over 900,000 line of news. And I want to scan these lines of texts for certain keyword. I tried
tic; strfind(newsDb.SingleNewline, kws{1}); toc
tic; contains(newsDb.SingleNewline, kws{1}); toc
both takes over 0.003 sec for search in one keyword in one news line.
If I want to create a new database with over 20,000 keywords, then it would take
900000 * 20000 * 0.003 / 60 / 60 / 24
over 600 days to do this. :(
Anyone has perhaps an idea how to to this within perhaps one-two day?
Thank you very much
6 Commenti
Ive J
il 24 Gen 2021
Can't you just load all contents into memory? So, you don't need to compare line by line, this would be much faster. What's you database format?
Walter Roberson
il 24 Gen 2021
You have not defined your desired output. Is it:
- for each different keyword, a list of the positions that the word occurs at, for each different line, for each news article?
- for each news article, a list of all of the keywords found in it?
- for each news article, a list per line of all of the keywords found on the line?
- for each keyword, a list of all of the news articles the keyword was found in?
Because if what you really want to know is which keywords were found in each news article, or if you just want to know which news articles match at least one keyword, then there are more efficient ways.
Question: what do you want to do about substrings, such as "bus" occurring inside "busy", or about the fact that the word "strudels" contains a rude word? What do you want to do about pluralizations, which may or may not be regular -- if the keyword is "cat" then should "cats" be matched? If "bus" is the keyword should "busses" be matched? If "mouse" is the keyword should "mice" be matched? If "moose" is the keyword, should "meese" be matched?
Song Decn
il 28 Gen 2021
Song Decn
il 28 Gen 2021
Walter Roberson
il 28 Gen 2021
What do you want to do about substrings, and plurals, and upper/lowercase and the other factors I asked about? For example if the headline were "Elon visits Oak Hammock Marsh" then is it acceptable that this would match "Mars" ? And "Elon eats musk-melon" ? And "Eucre trumps Bridge in recent poll" ?
Song Decn
il 8 Feb 2021
Risposta accettata
Più risposte (0)
Categorie
Scopri di più su Data Type Conversion in Centro assistenza e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!