Hybrid method to sentiment analysis column number error

Question

oliver il 12 Apr 2023

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/1945859-hybrid-method-to-sentiment-analysis-column-number-error

Modificato: Sanguk il 14 Apr 2023

IMBD_reviews_smol.csv

I am trying to perofrm the hybrid approach to sentiment analysis using both vader sentiment scores and machine learning using smv. I am trying to concatenate the sentiment scores with the bag of words features and then predict the sentiment labels for the next test set before evaluating th performance. however my X data is 1 coumn off being correct for the test. The error i recieve is

"Error using classreg.learning.internal.numPredictorsCheck

X data must have 1662 column(s).

Error in classreg.learning.classif.CompactClassificationECOC/predict (line 335)

classreg.learning.internal.numPredictorsCheck(X,...

Error in assessment_hybrid (line 50)

YPred = predict(mdl, XTest);"

my code is

% Load the movie review dataset
filename = "IMBD_reviews_first5000.csv"; 
data = readtable(filename,'TextType','string');
data.sentiment = categorical(data.sentiment);
% Split dataset into training and test sets using holdout
cvp = cvpartition(data.sentiment, 'Holdout', 0.1);
dataTrain = data(cvp.training, :);
dataTest = data(cvp.test, :);
% Extract review text and sentiment labels from training and test set 
textDataTrain = dataTrain.review;
textDataTest = dataTest.review;
YTrain = dataTrain.sentiment;
YTest = dataTest.sentiment;
% Preprocess training set
documents = preprocessText(textDataTrain);
% Create bag of words and remove infrequent words
bag = bagOfWords(documents);
bag = removeInfrequentWords(bag,2);
[bag,idx] = removeEmptyDocuments(bag);
YTrain(idx) = [];
% Encode training set using bag of words
XTrain = bag.Counts;
% Train SVM classifier
mdl = fitcecoc(XTrain, YTrain, "Learners", "linear");  
% Preprocess test set
documentsTest = preprocessText(textDataTest);
documentsTrain = preprocessText(textDataTrain);
% Encode test set using bag of words
XTest = encode(bag, documentsTest);
% Compute sentiment scores for training and test sets using VADER
sentimentScoresTrain = vaderSentimentScores(documentsTrain);
sentimentScoresTest = vaderSentimentScores(documentsTest);
% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];
% Predict sentiment labels for test set
YPred = predict(mdl, XTest);
% Evaluate performance
accuracy = sum(YPred == YTest) / numel(YTest);
fprintf("Accuracy: %.2f%%\n", accuracy * 100);
confusion = confusionmat(YTest, YPred);
truePositive = confusion(1, 1);
falsePositive = confusion(2, 1);
trueNegative = confusion(2, 2);
falseNegative = confusion(1, 2);
% Compute precision, recall, and F-measure
precision = truePositive / (truePositive + falsePositive);
recall = truePositive / (truePositive + falseNegative);
fMeasure = 2 * precision * recall / (precision + recall);
% Compute accuracy 
accuracy2 = (truePositive + trueNegative) / numel(YTest);
% Display results 
disp(['True positive: ' num2str(truePositive)]);
disp(['False positive: ' num2str(falsePositive)]);
disp(['True negative: ' num2str(trueNegative)]);
disp(['False negative: ' num2str(falseNegative)]);
disp(['Precision: ' num2str(precision)]);
disp(['Recall: ' num2str(recall)]);
disp(['F-measure: ' num2str(fMeasure)]);
function documents = preprocessText(textData)
documents = tokenizedDocument(textData);
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = erasePunctuation(documents);
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);
end

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Drew il 12 Apr 2023

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/1945859-hybrid-method-to-sentiment-analysis-column-number-error#answer_1214554

Modificato: Drew il 12 Apr 2023

Apri in MATLAB Online

Updating this answer based on the comment below:

You have built one svm classifier based on the bag-of-words features. It sounds like what you want to do is build another classifier that uses both the bag-of-words features and the vader sentiment score feature. So, after you concatentate the sentiment scores with the bag-of-words features, you should train a new model on those combined features, and then evaluate that model.

% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];
% Build new svm model using both bag-of-words and vader sentiment scores as
% features
mdl2 = fitcecoc(XTrain, YTrain, "Learners", "linear");  

Original answer:

As the error message indicates, on line 50 of your code (shown below), the number of predictors in XTest does not match the number of predictors used to train the model.

YPred = predict(mdl, XTest);

After training the model, the lines below added one extra predictor, and hence the mismatch.

% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];

The number of predictors at model training time needs to match the number of predictors at model testing time.

2 Commenti
Mostra NessunoNascondi Nessuno

oliver il 12 Apr 2023

do you know how i would change the code to prevent adding one extra predictor

Sanguk il 13 Apr 2023

Modificato: Sanguk il 14 Apr 2023

Apri in MATLAB Online

I get this error

"Error using horzcat

Dimensions of arrays being concatenated are not consistent.

Error in reun (line 44)

XTrain = [XTrain, sentimentScoresTrain];"

from running:

filename = "sentiment_irrelevantdrop_Chegg"; 
data = readtable(filename,'TextType','string');
data.sentiment = categorical(data.sentiment);
% Split dataset into training and test sets using holdout
cvp = cvpartition(data.sentiment, 'Holdout', 0.1);
dataTrain = data(cvp.training, :);
dataTest = data(cvp.test, :);
% Extract review text and sentiment labels from training and test set 
textDataTrain = dataTrain.text;
textDataTest = dataTest.text;
YTrain = dataTrain.sentiment;
YTest = dataTest.sentiment;
% Preprocess training set
documents = preprocessText(textDataTrain);
% Create bag of words and remove infrequent words
bag = bagOfWords(documents);
bag = removeInfrequentWords(bag,2);
[bag,idx] = removeEmptyDocuments(bag);
YTrain(idx) = [];
% Encode training set using bag of words
XTrain = bag.Counts;
% Train SVM classifier
mdl = fitcecoc(XTrain, YTrain, "Learners", "linear");  
% Preprocess test set
documentsTest = preprocessText(textDataTest);
documentsTrain = preprocessText(textDataTrain);
% Encode test set using bag of words
XTest = encode(bag, documentsTest);
% Compute sentiment scores for training and test sets using VADER
sentimentScoresTrain = vaderSentimentScores(documentsTrain);
sentimentScoresTest = vaderSentimentScores(documentsTest);
% Concatenate sentiment scores with bag of words features
XTrain = [XTrain, sentimentScoresTrain];
XTest = [XTest, sentimentScoresTest];
% Build new svm model using both bag-of-words and vader sentiment scores as
% features
mdl2 = fitcecoc(XTrain, YTrain, "Learners", "linear");  
% Predict sentiment labels for test set
YPred = predict(mdl, XTest);
% Evaluate performance
accuracy = sum(YPred == YTest) / numel(YTest);
fprintf("Accuracy: %.2f%%\n", accuracy * 100);
confusion = confusionmat(YTest, YPred);
truePositive = confusion(1, 1);
falsePositive = confusion(2, 1);
trueNegative = confusion(2, 2);
falseNegative = confusion(1, 2);
% Compute precision, recall, and F-measure
precision = truePositive / (truePositive + falsePositive);
recall = truePositive / (truePositive + falseNegative);
fMeasure = 2 * precision * recall / (precision + recall);
% Compute accuracy 
accuracy2 = (truePositive + trueNegative) / numel(YTest);
% Display results 
disp(['True positive: ' num2str(truePositive)]);
disp(['False positive: ' num2str(falsePositive)]);
disp(['True negative: ' num2str(trueNegative)]);
disp(['False negative: ' num2str(falseNegative)]);
disp(['Precision: ' num2str(precision)]);
disp(['Recall: ' num2str(recall)]);
disp(['F-measure: ' num2str(fMeasure)]);
function documents = preprocessText(textData)
documents = tokenizedDocument(textData);
documents = addPartOfSpeechDetails(documents);
documents = removeStopWords(documents);
documents = erasePunctuation(documents);
documents = removeShortWords(documents,2);
documents = removeLongWords(documents,15);
end

btw, thank for your coding Oliver

Accedi per commentare.

Hybrid method to sentiment analysis column number error

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

2 Commenti
Mostra NessunoNascondi Nessuno

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

Hybrid method to sentiment analysis column number error

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

2 Commenti Mostra NessunoNascondi Nessuno

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

2 Commenti
Mostra NessunoNascondi Nessuno