Getting Representative Documents for Topics: bertopic==0.9.2 #285

gsalfourn · 2021-10-13T03:44:50Z

In the link: https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.get_representative_docs you show how to extract representative documents for all topics or a single topic.

To extract the representative docs of all topics you suggest using representative_docs = topic_model.get_representative_docs()
and to get the representative docs of a single topic, to use representative_docs = topic_model.get_representative_docs(topic=12)

Getting the representative docs for a single topic works as you suggested, however, there appears to a problem with getting representative docs for all topics using the approach you suggested. When I try to get representative docs for all docs with topic_model.get_representative_docs() it gives me an error message suggesting that I am missing an argument:

TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_49040/3786687066.py in <module>
----> 1 topic_model.get_representative_docs()
TypeError: get_representative_docs() missing 1 required positional argument: 'topic'

The interesting thing is that when I use the following three approaches:

>>> topic_model.get_representative_docs(topic_model)
>>> topic_model.representative_docs
>>> topic_model.representative_docs.items()

none of them give any error messages; they all give me an unordered dictionary of representative docs for all topics.

The text was updated successfully, but these errors were encountered:

MaartenGr · 2021-10-13T04:55:01Z

I see, this happens because there is a = None missing in the topics hyperparameter which makes it a required argument. For now, you can try it like this to get all representative documents: topic_model.get_representative_docs(topic=None).

However, there is a quick fix available here that fixes the issue. I will most likely release that fix to pypi today.

MitraMitraMitra · 2021-10-13T13:57:52Z

I was also confused why topic_model.get_representative_docs(x) was returning a dictionary instead of a list. I found that I was actually passing a string that looked like a number (I got x from the 'Topic' column of the dataframe returned by topic_model.get_topic_info()) instead of an integer.

* Fix #282, #285, #288

MaartenGr · 2021-10-17T06:50:17Z

@gsalfourn @MitraMitraMitra A new version of BERTopic (v0.9.3) was released that should fix this issue and some others that should be helpful. You can install that version through pip install --upgrade bertopic. If you have any questions regarding this issue, release, or some other issue, please let me know!

amrityap · 2022-08-01T15:13:30Z

Hey Maarten, I was running bertopic on user reviews of an app. My goal is to perform sentiment analysis on reviews per topic. I managed to get topics. But now I need to print the reviews per topic along with their sentiment label (1 or 0). topic_model.get_representative_docs() only print the reviews with their topic. Is there a way to keep other columns like sentiment label and star rating so I can perform sentiment analysis per topic?

mdcox · 2022-08-07T15:13:55Z

Hey @MaartenGr, I have been working with bertopic for a while and it is really awesome, I appreciate your dedication and help!

I am trying to get representative documents for all topics (minus the -1 topic). When I use topic_model.get_representative_docs() it appears to only partially work, I am seeing quite a few topics that do not have a representative document. Is that expected behavior? I would think that there should be at least 1 document per topic so it appears to be a little weird behavior. f so is there some level of work around?

Here is the code I am using to get the representative sentences:

def get_topic_rep_docs(bert_model):
    """
    Desc:
        - takes in the bert model and gets the model assigned representative
        docs per each topic
    Inputs
        - bert_model: the model which contains the get_topics() fn
    Returns:
        - all_topics_rep: dict with keys as topic #'s and array of that 
        topic's representative docs as values
    """

    all_topics = bert_model.get_topics()
    all_topics_rep = {}

    for key in all_topics.keys():

        if (key == -1):
            continue

        all_topics_rep[key] = bert_model.get_representative_docs(key)

    return all_topics_rep

MaartenGr · 2022-08-08T07:57:51Z

@mdcox It seems that your code is correct, so it is indeed strange that you are not getting the topics you are looking for. Having said that, it is difficult to see more without knowing a bit more. Which version of BERTopic are you using? Could you share your entire code for training BERTopic?

mdcox · 2022-08-08T13:56:02Z

@MaartenGr thank you for the quick reply! We are currently using bertopic==0.9.4. Below is the training code that we are currently running. I am doing some deeper debugging also and will update if that leads anywhere!

import os
import nltk
import time
import torch
import pandas as pd
from bertopic import BERTopic
import model.bertvis as bv
# from sentence_transformers import SentenceTransformer
from model.preprocess import main_preprocess
import model.evaluate as eval

nltk.download("punkt")
nltk.download("wordnet")
nltk.download('omw-1.4')


def get_file(data_path):

    """
    Desc:
        Loading data based on file path, this can be used for local use but
        is likely not relevant when pulling data from AWS
    Inputs:
        data_path (str): path to csv file that contains data to be used for
        model training
    Returns:
        dataframe: dataframe with data
    """

    cwd = os.getcwd()
    # Print the current working directory
    print(f"Current working directory: {cwd}")
    print("Trying to get: ", data_path)

    assert os.path.exists(data_path), \
        "File not found at specified path: " + str(data_path)
    print("Train data loaded successfully")
    names = ['comment', 'morale', 'date']

    return pd.read_csv(data_path, header=None, names=names)


def bert_model(docs, embedding_model, vectorizer_model,
               num_of_topics, topic_num_words, docs_per_topic,
               calc_probs=True):
    """
    bert_model generates the model object and returns the
    trained model on the documents

    Args:
        docs (arr of str): all documents that have been cleaned
        embedding_model (downloaded embedded model): method used to embed docs
        vectorizer_model: model used to vectorized documents
        num_of_topics (int): how many topics you want from the model
        topic_num_words (int): words per topic
        docs_per_topic (int): number of docs required for a cluster to be
                              considered a topic
        calc_probs (bool, optional):  to calc the probabilities from Bertopic.

    Returns:
        model (bertopic model): trained bertopic model
        topics (array): topics produced by the model
        probs (array): the probability each keyword is found in assigned topic
    """

    model = BERTopic(
        embedding_model=embedding_model,
        vectorizer_model=vectorizer_model,
        calculate_probabilities=calc_probs,
        nr_topics=num_of_topics,
        top_n_words=topic_num_words,   # Default is 10.
        verbose=True,
        # nr_topics="auto",  # Interesting, but produces too many topics.
        min_topic_size=docs_per_topic,  # Min number of docs per topic
    )

    print("[  train  ] Fitting...")
    start = time.time()
    # Fit & Transform on supplied data
    topics, probs = model.fit_transform(docs)
    end = time.time()
    print("[  train  ] Fitted")

    print(f"[  train  ]     FIT_TRANSFORM TIME:     {end-start}")

    return model, topics, probs


def get_score_dict(model, topics, docs, doc_ids, topic_num_words):

    score_dict = {
        'topic_num_words': topic_num_words,
        'model': model,
        'topics': topics,
        'docs': docs,
        'doc_ids': doc_ids,
    }

    return score_dict


def train_main(data_df, args_dict):
    """
    Desc:
        Main function that runs all needed components to train model
        based on datafile.
    Inputs:
        data_df: dataframe containing s3 data
        args_dict: dictionary with parameters from cmd line args
    Returns:
        results_dict: dictionary containing the results
        idm_vis: json data for model visualization from pyLDAvis
        tot_vis: json data for topics over time visualization
    """
    os.environ["TOKENIZERS_PARALLELISM"] = "true"

    # CUDA check
    device = torch.device("cuda:0"
                          if torch.cuda.is_available()
                          else "cpu")
    print(f"[  train  ] Device:                 {device}")

    num_of_topics = args_dict["num_of_topics"]
    topic_num_words = args_dict["num_of_keywords"]
    docs_per_topic = args_dict["docs_per_topic"]
    embedding_dict = {
        'sentence_embeded_model': True,
        'flair_embedding_model': False,
        'spacy_transformer': False,
        'USE': False
    }

    # Preprocessing Timing start
    start = time.time()
    print('[  train  ] Data Preprocessing...')
    embedding_model, vectorizer_model,\
        docs, doc_ids, timestamps = main_preprocess(data_df, embedding_dict)

    end = time.time()
    print(f"[  train  ]     PREPROCESS TIME:    {end-start}")

    print('[  train  ] Model Training...')
    model, topics, probs = bert_model(docs,
                                      embedding_model,
                                      vectorizer_model,
                                      num_of_topics,
                                      topic_num_words,
                                      docs_per_topic,
                                      calc_probs=True,
                                      )

    score_dict = get_score_dict(model, topics, docs, doc_ids, topic_num_words)
    result_dict = eval.score_main(score_dict, model)

    idm_vis, tot_vis = bv.vis_main(model, docs, probs,
                                   num_of_topics, topics, timestamps)

    return result_dict, idm_vis, tot_vis

MaartenGr · 2022-08-09T07:29:54Z

@mdcox There has been some significant fixes since v0.9.4 and I believe that correctly finding representative documents might be one of them. I would advise using the newest version as those issues might already be fixed.

mdcox · 2022-08-11T03:14:19Z

@mdcox There has been some significant fixes since v0.9.4 and I believe that correctly finding representative documents might be one of them. I would advise using the newest version as those issues might already be fixed.

Okay good note! It turns out there was a small big that I just found as well Thank you very much for the help.... definitely will start using the updated version.

MaartenGr added a commit that referenced this issue Oct 13, 2021

Fix #285

4603979

MaartenGr mentioned this issue Oct 13, 2021

Quickfix #284

Merged

MaartenGr added a commit that referenced this issue Oct 17, 2021

v0.9.3 - Quickfix (#284)

15ea0cd

* Fix #282, #285, #288

MaartenGr closed this as completed Oct 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Representative Documents for Topics: bertopic==0.9.2 #285

Getting Representative Documents for Topics: bertopic==0.9.2 #285

gsalfourn commented Oct 13, 2021

MaartenGr commented Oct 13, 2021

MitraMitraMitra commented Oct 13, 2021 •

edited

Loading

MaartenGr commented Oct 17, 2021

amrityap commented Aug 1, 2022 •

edited

Loading

mdcox commented Aug 7, 2022 •

edited

Loading

MaartenGr commented Aug 8, 2022

mdcox commented Aug 8, 2022

MaartenGr commented Aug 9, 2022

mdcox commented Aug 11, 2022

Getting Representative Documents for Topics: bertopic==0.9.2 #285

Getting Representative Documents for Topics: bertopic==0.9.2 #285

Comments

gsalfourn commented Oct 13, 2021

MaartenGr commented Oct 13, 2021

MitraMitraMitra commented Oct 13, 2021 • edited Loading

MaartenGr commented Oct 17, 2021

amrityap commented Aug 1, 2022 • edited Loading

mdcox commented Aug 7, 2022 • edited Loading

MaartenGr commented Aug 8, 2022

mdcox commented Aug 8, 2022

MaartenGr commented Aug 9, 2022

mdcox commented Aug 11, 2022

MitraMitraMitra commented Oct 13, 2021 •

edited

Loading

amrityap commented Aug 1, 2022 •

edited

Loading

mdcox commented Aug 7, 2022 •

edited

Loading