Why in topic modeling by LDA, stop words still exist within the generated topics, although I removed it by the stop words removal function ?

Jack

25 Set 2021

0 Risposte

Aggiornato 25 Set 2021

13 Visualizzazioni (30 giorni)

Accedi per rispondere a questa domanda.

Follow Question

Accedi per rispondere a questa domanda.

Follow Question

Mostra commenti meno recenti

0 voti

Hello and good day to you..

I am doing topic modling by Latent Dirichlet Allocation (LDA), and this require preprocessing (cleaning) the data before. Thus, I did preprocessing steps in order as follows:

1- Tokenize the text using tokenizedDocument.

2- addPartOfSpeechDetails

3- Lemmatize the words using normalizeWords.

4- Erase punctuation using erasePunctuation.

5- Remove a list of stop words (such as "and", "of", and "the") using removeStopWords.

6- Remove words with 2 or fewer characters using removeShortWords.

7- Remove words with 15 or more characters using removeLongWords.

However, when topics generated by the LDA model, whereby a topic in LDA means (a collection of propably related words), there is a topic contain stop words although it were removed from the data by the step number 5. thus it must not be exist in the data to be modeld by the LDA. why these stop words still there and showed as one of resulted topics, althgouh these words do not even exist in the Vocabulary of the model ?

Please HELP !