Why in topic modeling by LDA, stop words still exist within the generated topics, although I removed it by the stop words removal function ?
Mostra commenti meno recenti
Hello and good day to you..
I am doing topic modling by Latent Dirichlet Allocation (LDA), and this require preprocessing (cleaning) the data before. Thus, I did preprocessing steps in order as follows:
1- Tokenize the text using tokenizedDocument.
2- addPartOfSpeechDetails
3- Lemmatize the words using normalizeWords.
4- Erase punctuation using erasePunctuation.
5- Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.
6- Remove words with 2 or fewer characters using removeShortWords.
7- Remove words with 15 or more characters using removeLongWords.
However, when topics generated by the LDA model, whereby a topic in LDA means (a collection of propably related words), there is a topic contain stop words although it were removed from the data by the step number 5. thus it must not be exist in the data to be modeld by the LDA. why these stop words still there and showed as one of resulted topics, althgouh these words do not even exist in the Vocabulary of the model ?
Please HELP !
Risposte (0)
Categorie
Scopri di più su Modeling and Prediction in Centro assistenza e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!