removing stopwords when embedding is tf-idf #760

avafor · 2022-10-04T18:40:42Z

avafor
Oct 4, 2022

Hi,
following one of th examples you have provided I was trying to do

`vectorizer = TfidfVectorizer(min_df=5)
embeddings = vectorizer.fit_transform(docs)

Train our topic model using TF-IDF vectors

topic_model = BERTopic(stop_words="english")`

However, I get the error that BERTopic doesn not stop_words.
What I would like to do is to remove the stop words before doing the clustering. My intension is not geting rid of the stop words only in the topic representation but also in the clustering step.

Thank you and Best Regards,
Avafor

MaartenGr · 2022-10-05T07:07:01Z

MaartenGr
Oct 5, 2022
Maintainer

If you want to remove stop words before the clustering you would have to clean your documents beforehand and then pass them to BERTopic. Do note that the TfidfVectorizer also supports a stop_words argument that you can use to remove the stopwords from the embeddings.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

removing stopwords when embedding is tf-idf #760

{{title}}

Replies: 1 comment

{{title}}

Select a reply

removing stopwords when embedding is tf-idf #760

avafor Oct 4, 2022

Train our topic model using TF-IDF vectors

Replies: 1 comment

MaartenGr Oct 5, 2022 Maintainer

avafor
Oct 4, 2022

MaartenGr
Oct 5, 2022
Maintainer