Azzera filtri
Azzera filtri

How to find the most used word in a text?

5 visualizzazioni (ultimi 30 giorni)
Armina Petrean
Armina Petrean il 3 Apr 2023
Modificato: DGM il 3 Apr 2023
i have a notepad file with a literary text and i need to find the most repeated word/words . How many times they appear in that text.
  1 Commento
the cyclist
the cyclist il 3 Apr 2023
Modificato: the cyclist il 3 Apr 2023
FYI, this question was closed by another editor as a duplicate, but I don't think it was. This question is asking about repeated words, and the other was asking about repeated letters.

Accedi per commentare.

Risposte (3)

the cyclist
the cyclist il 3 Apr 2023
Modificato: the cyclist il 3 Apr 2023
I'm putting this answer here as possibly the "canonical" MATLAB answer, but I expect you do not have the Text Analytics Toolbox.
myTextFile = "sonnets.txt"; % Put your file name here
str = extractFileText(myTextFile);
T = wordCloudCounts(str);

DGM
DGM il 3 Apr 2023
Modificato: DGM il 3 Apr 2023
Define "word". Once you have defined "word" and have implemented a means to split a block of text into said words, then the rest is basic.
I'm sure this can be improved a lot, but I was in a hurry.
bunchofwords = fileread('wordpile.txt')
bunchofwords =
'This is a text file. This file contains many words. It also contains a list: 1: Entry one (first) 2: Entry two (second) 3: This is the third entry in the list. Sometimes words need to be hyphen- ated in order to make them fit. I'm sure any reasonably-observant person would notice that not all hyphenation should be treated the same. I'm sure they'd also notice the problems with quotes and 'apostrophes'. '
% i assume the capitalization doesn't matter
bunchofwords = lower(bunchofwords);
% try to fix words that are hyphenated on linebreaks
% but not all hyphenation is done with U+002D
bunchofwords = regexprep(bunchofwords,'(?<=\w+)-(\r\n|\r|\n)+(?=\w+)','');
% split the file into blobs separated by whitespace
% this causes lots of problems
%words = regexp(bunchofwords,'\S+','match');
% instead, split the file into blobs of "word" type characters
% this still has problems, but it's a bit better
words = regexp(bunchofwords,'\w+','match');
% find unique words
[uwords,~,uwidx] = unique(words);
% get histogram counts and sort them
hc = histcounts(uwidx,'binmethod','integers');
[hc hcidx] = sort(hc,'descend');
% sort unique word list by frequency
uwordssorted = uwords(hcidx);
% display the results as a table as a cursory effort toward readability
table(uwordssorted.',hc.')
ans = 54×2 table
Var1 Var2 ____________ ____ {'the' } 4 {'entry' } 3 {'this' } 3 {'a' } 2 {'also' } 2 {'be' } 2 {'contains'} 2 {'file' } 2 {'i' } 2 {'in' } 2 {'is' } 2 {'list' } 2 {'m' } 2 {'notice' } 2 {'sure' } 2 {'to' } 2
Note that this still has plenty of problems with contractions.
  2 Commenti
Image Analyst
Image Analyst il 3 Apr 2023
Or simpler than
words = regexp(bunchofwords,'\w+','match');
is to use strsplit
words = strsplit(bunchofwords);
DGM
DGM il 3 Apr 2023
Modificato: DGM il 3 Apr 2023
No, that would be similar to the first example, naively splitting on whitespace. This causes problems with any punctuation. Note the cases of 'file', 'list', and 'words'.
bunchofwords = fileread('wordpile.txt');
bunchofwords = lower(bunchofwords);
uwords = unique(strsplit(bunchofwords))
uwords = 1×34 cell array
Columns 1 through 17 {0×0 char} {'1:'} {'2:'} {'3:'} {'a'} {'also'} {'ated'} {'be'} {'contains'} {'entry'} {'file'} {'file.'} {'fit.'} {'hyphen-'} {'in'} {'is'} {'it'} Columns 18 through 33 {'list.'} {'list:'} {'make'} {'many'} {'need'} {'one'} {'order'} {'sometimes'} {'text'} {'the'} {'them'} {'third'} {'this'} {'to'} {'two'} {'words'} Column 34 {'words.'}
uwords = unique(regexp(bunchofwords,'\S+','match'))
uwords = 1×33 cell array
Columns 1 through 17 {'1:'} {'2:'} {'3:'} {'a'} {'also'} {'ated'} {'be'} {'contains'} {'entry'} {'file'} {'file.'} {'fit.'} {'hyphen-'} {'in'} {'is'} {'it'} {'list.'} Columns 18 through 33 {'list:'} {'make'} {'many'} {'need'} {'one'} {'order'} {'sometimes'} {'text'} {'the'} {'them'} {'third'} {'this'} {'to'} {'two'} {'words'} {'words.'}
uwords = unique(regexp(bunchofwords,'\w+','match'))
uwords = 1×30 cell array
Columns 1 through 17 {'1'} {'2'} {'3'} {'a'} {'also'} {'ated'} {'be'} {'contains'} {'entry'} {'file'} {'fit'} {'hyphen'} {'in'} {'is'} {'it'} {'list'} {'make'} Columns 18 through 30 {'many'} {'need'} {'one'} {'order'} {'sometimes'} {'text'} {'the'} {'them'} {'third'} {'this'} {'to'} {'two'} {'words'}
I'm sure there are better ways to handle splitting into words, but using \w+ was simple enough.

Accedi per commentare.


Image Analyst
Image Analyst il 3 Apr 2023
If you don't have the Text Analytics Toolbox (like @the cyclist solution requires) then you can get a histogram like this:
str = 'abcddrd,ee,fghd,**^^###$s t q j' % Whatever your character array is
str = 'abcddrd,ee,fghd,**^^###$s t q j'
% Convert characters to numbers.
strAscii = str - char(0);
% Compute histogram
edges = 0 : max(strAscii);
counts = histogram(strAscii, edges);
% Fancy up the plot.
grid on;
xlabel('ASCII value');
ylabel('Count');
title('Histogram of Characters')
  2 Commenti
the cyclist
the cyclist il 3 Apr 2023
Unless I misunderstand, this solution finds the count of characters. This question (and my solution) is about finding words.
Image Analyst
Image Analyst il 3 Apr 2023
I think your solution is more like what the OP wants. But maybe I'll leave mine up in case someone in the future stumbles across it and wants a histogram of characters.
By the way, if he doesn't have that toolbox, is there a solution for a histogram of complete words?

Accedi per commentare.

Categorie

Scopri di più su Data Distribution Plots in Help Center e File Exchange

Prodotti

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by