Frequency words for each labels

Question

0 voti

I have one dataset with two columns: text and data. The data is made up two labels 0 and 1. I would like to calculate the frequency of each word for each labels. I mean, how many time, for example "damage" there is within class 1 and 0? How can I do? Furthermore, I don't understand if I have to, however, use tokens or no. Maybe I can use a cicle for? I don't know it.

Here there is a little image with a similar result. I would like a similar table.

0 Commenti
Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Follow Question

Answer 1

Karim il 7 Lug 2022

Modificato: Karim il 7 Lug 2022

Apri in MATLAB Online

1 voto

dati_classificati.xlsx

Edit to make so that the code works with the latter added example data...

% read the file
data = readtable("dati_classificati.xlsx",'TextType','string');
% split each sentence into words, assuming that spaces are used as delimiter...
cell_text = arrayfun(@(x) data.text(x,:),1:size(data.text,1),'UniformOutput',false)';
cell_text = cellfun(@(x) split(x,' '), cell_text,'UniformOutput',false);
% count the number of words in each sentence
numWords = cellfun(@numel, cell_text);
% expand the labels to match the number of words for each sentence
expandedLabels = repelem( data.label ,numWords);
% gather the words in 1 big string array
expandedWords = vertcat(cell_text{:});
% list a few words to count the frequency...
MyWords = ["strada" "il" "Via" "donne" "della"];
% allocate a table for the results
varTypes = ["string","double","double"]; % data type for each column
varNames = ["Words","Ones","Zeros"]; % variable name for each column
MyResult = table('Size',[numel(MyWords) 3],'VariableTypes',varTypes,'VariableNames',varNames);
MyResult.Words = MyWords(:);
% count the labels for each word
for i = 1:numel(MyWords)
    currLabels = expandedLabels( contains(expandedWords,MyResult.Words(i)) );
     MyResult.Ones(i) = sum(currLabels==1);
     MyResult.Zeros(i) = sum(currLabels==0);
end
% display the results
MyResult
MyResult = 5×3 table
     Words      Ones    Zeros
    ________    ____    _____

    "strada"     48       1  
    "il"         34      20  
    "Via"        53       0  
    "donne"       0       2  
    "della"       3      14  

9 Commenti
Mostra 7 commenti meno recenti Nascondi 7 commenti meno recenti

Rachele Franceschini il 7 Lug 2022

Apri in MATLAB Online

I used your code. I put one image of the result. I tried also to put a pre-process for cleaning data. But I would like to get: how many time there is the word "ciao" within of classes 1 and 0 etc

% first gererate some random data..
MyWords = daticlassificati.text;
% now create a big list from the set of words
numItems = 1000; 
BigList = MyWords ( randi(numel(MyWords),1,numItems) )
% crea un elenco con etichette casuali 0 o 1
RandomLabel = daticlassificati.label
uWords = unique(BigList);
% allocate a table for the results
varTypes = ["string","double","double"]; % data type for each column
varNames = ["Words","Ones","Zeros"]; % variable name for each column
MyResult = table('Size',[numel(uWords) 3],'VariableTypes',varTypes,'VariableNames',varNames);
MyResult.Words = uWords(:);
% count the labels for each word
for i = 1:numel(uWords)
    currLabels = RandomLabel(contains(BigList,MyResult.Words(i)));
    MyResult.Ones(i) = sum(currLabels==1);
    MyResult.Zeros(i) = sum(currLabels==0);
end
% display the results
MyResult

I put my code with preprocess for cleaning dataset

% input file excel or text
filename = "dati_classificati.xlsx";
data = readtable(filename,'TextType','string');
% remove the rows of the table with empty reports (classify text data using deep learning)    
    idx = strlength(data.text) == 0;
    data(idx,:) = [];
% read and next extract all raws of the colomn name (X)
textData = data.text;
% clean data (remove punctuation etc.)
Train_pr = preprocessText(textData);
    Train_bag = bagOfWords(Train_pr)  
    Train_bag = removeInfrequentWords(Train_bag,5);
    [Train_bag,idx] = removeEmptyDocuments(Train_bag);
    Train_bag
tbl_train = topkwords(Train_bag,2000);

Karim il 7 Lug 2022

I modified the original answer accoring to the file you provided, see at the top. Note that i just used the raw text and only included a few words. But normally now you see how the concept works.

Rachele Franceschini il 7 Lug 2022

VERY VERY thank you!!!!Thank you so much!!I tried also with pre-process and it is ok!

Accedi per commentare.

Frequency words for each labels

0 Commenti
Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

Risposta accettata

9 Commenti
Mostra 7 commenti meno recenti Nascondi 7 commenti meno recenti

Più risposte (0)

Categorie

Prodotti

Release

Tag

Community Treasure Hunt

Frequency words for each labels

0 Commenti Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

Risposta accettata

9 Commenti Mostra 7 commenti meno recenti Nascondi 7 commenti meno recenti

Più risposte (0)

Categorie

Prodotti

Release

Tag

Vedere anche

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

9 Commenti
Mostra 7 commenti meno recenti Nascondi 7 commenti meno recenti