Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting Representative Documents for Topics: bertopic==0.9.2 #285

Closed
gsalfourn opened this issue Oct 13, 2021 · 9 comments
Closed

Getting Representative Documents for Topics: bertopic==0.9.2 #285

gsalfourn opened this issue Oct 13, 2021 · 9 comments

Comments

@gsalfourn
Copy link

@maarten,

In the link: https://maartengr.github.io/BERTopic/api/bertopic.html#bertopic._bertopic.BERTopic.get_representative_docs you show how to extract representative documents for all topics or a single topic.

To extract the representative docs of all topics you suggest using representative_docs = topic_model.get_representative_docs()
and to get the representative docs of a single topic, to use representative_docs = topic_model.get_representative_docs(topic=12)

Getting the representative docs for a single topic works as you suggested, however, there appears to a problem with getting representative docs for all topics using the approach you suggested. When I try to get representative docs for all docs with topic_model.get_representative_docs() it gives me an error message suggesting that I am missing an argument:

TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_49040/3786687066.py in <module>
----> 1 topic_model.get_representative_docs()
TypeError: get_representative_docs() missing 1 required positional argument: 'topic'

The interesting thing is that when I use the following three approaches:

>>> topic_model.get_representative_docs(topic_model)
>>> topic_model.representative_docs
>>> topic_model.representative_docs.items()

none of them give any error messages; they all give me an unordered dictionary of representative docs for all topics.

MaartenGr added a commit that referenced this issue Oct 13, 2021
@MaartenGr
Copy link
Owner

I see, this happens because there is a = None missing in the topics hyperparameter which makes it a required argument. For now, you can try it like this to get all representative documents: topic_model.get_representative_docs(topic=None).

However, there is a quick fix available here that fixes the issue. I will most likely release that fix to pypi today.

@MaartenGr MaartenGr mentioned this issue Oct 13, 2021
@MitraMitraMitra
Copy link

MitraMitraMitra commented Oct 13, 2021

I was also confused why topic_model.get_representative_docs(x) was returning a dictionary instead of a list. I found that I was actually passing a string that looked like a number (I got x from the 'Topic' column of the dataframe returned by topic_model.get_topic_info()) instead of an integer.

MaartenGr added a commit that referenced this issue Oct 17, 2021
@MaartenGr
Copy link
Owner

@gsalfourn @MitraMitraMitra A new version of BERTopic (v0.9.3) was released that should fix this issue and some others that should be helpful. You can install that version through pip install --upgrade bertopic. If you have any questions regarding this issue, release, or some other issue, please let me know!

@amrityap
Copy link

amrityap commented Aug 1, 2022

Hey Maarten, I was running bertopic on user reviews of an app. My goal is to perform sentiment analysis on reviews per topic. I managed to get topics. But now I need to print the reviews per topic along with their sentiment label (1 or 0). topic_model.get_representative_docs() only print the reviews with their topic. Is there a way to keep other columns like sentiment label and star rating so I can perform sentiment analysis per topic?

@mdcox
Copy link

mdcox commented Aug 7, 2022

Hey @MaartenGr, I have been working with bertopic for a while and it is really awesome, I appreciate your dedication and help!

I am trying to get representative documents for all topics (minus the -1 topic). When I use topic_model.get_representative_docs() it appears to only partially work, I am seeing quite a few topics that do not have a representative document. Is that expected behavior? I would think that there should be at least 1 document per topic so it appears to be a little weird behavior. f so is there some level of work around?

Here is the code I am using to get the representative sentences:

def get_topic_rep_docs(bert_model):
    """
    Desc:
        - takes in the bert model and gets the model assigned representative
        docs per each topic
    Inputs
        - bert_model: the model which contains the get_topics() fn
    Returns:
        - all_topics_rep: dict with keys as topic #'s and array of that 
        topic's representative docs as values
    """

    all_topics = bert_model.get_topics()
    all_topics_rep = {}

    for key in all_topics.keys():

        if (key == -1):
            continue

        all_topics_rep[key] = bert_model.get_representative_docs(key)

    return all_topics_rep

@MaartenGr
Copy link
Owner

@mdcox It seems that your code is correct, so it is indeed strange that you are not getting the topics you are looking for. Having said that, it is difficult to see more without knowing a bit more. Which version of BERTopic are you using? Could you share your entire code for training BERTopic?

@mdcox
Copy link

mdcox commented Aug 8, 2022

@MaartenGr thank you for the quick reply! We are currently using bertopic==0.9.4. Below is the training code that we are currently running. I am doing some deeper debugging also and will update if that leads anywhere!

import os
import nltk
import time
import torch
import pandas as pd
from bertopic import BERTopic
import model.bertvis as bv
# from sentence_transformers import SentenceTransformer
from model.preprocess import main_preprocess
import model.evaluate as eval

nltk.download("punkt")
nltk.download("wordnet")
nltk.download('omw-1.4')


def get_file(data_path):

    """
    Desc:
        Loading data based on file path, this can be used for local use but
        is likely not relevant when pulling data from AWS
    Inputs:
        data_path (str): path to csv file that contains data to be used for
        model training
    Returns:
        dataframe: dataframe with data
    """

    cwd = os.getcwd()
    # Print the current working directory
    print(f"Current working directory: {cwd}")
    print("Trying to get: ", data_path)

    assert os.path.exists(data_path), \
        "File not found at specified path: " + str(data_path)
    print("Train data loaded successfully")
    names = ['comment', 'morale', 'date']

    return pd.read_csv(data_path, header=None, names=names)


def bert_model(docs, embedding_model, vectorizer_model,
               num_of_topics, topic_num_words, docs_per_topic,
               calc_probs=True):
    """
    bert_model generates the model object and returns the
    trained model on the documents

    Args:
        docs (arr of str): all documents that have been cleaned
        embedding_model (downloaded embedded model): method used to embed docs
        vectorizer_model: model used to vectorized documents
        num_of_topics (int): how many topics you want from the model
        topic_num_words (int): words per topic
        docs_per_topic (int): number of docs required for a cluster to be
                              considered a topic
        calc_probs (bool, optional):  to calc the probabilities from Bertopic.

    Returns:
        model (bertopic model): trained bertopic model
        topics (array): topics produced by the model
        probs (array): the probability each keyword is found in assigned topic
    """

    model = BERTopic(
        embedding_model=embedding_model,
        vectorizer_model=vectorizer_model,
        calculate_probabilities=calc_probs,
        nr_topics=num_of_topics,
        top_n_words=topic_num_words,   # Default is 10.
        verbose=True,
        # nr_topics="auto",  # Interesting, but produces too many topics.
        min_topic_size=docs_per_topic,  # Min number of docs per topic
    )

    print("[  train  ] Fitting...")
    start = time.time()
    # Fit & Transform on supplied data
    topics, probs = model.fit_transform(docs)
    end = time.time()
    print("[  train  ] Fitted")

    print(f"[  train  ]     FIT_TRANSFORM TIME:     {end-start}")

    return model, topics, probs


def get_score_dict(model, topics, docs, doc_ids, topic_num_words):

    score_dict = {
        'topic_num_words': topic_num_words,
        'model': model,
        'topics': topics,
        'docs': docs,
        'doc_ids': doc_ids,
    }

    return score_dict


def train_main(data_df, args_dict):
    """
    Desc:
        Main function that runs all needed components to train model
        based on datafile.
    Inputs:
        data_df: dataframe containing s3 data
        args_dict: dictionary with parameters from cmd line args
    Returns:
        results_dict: dictionary containing the results
        idm_vis: json data for model visualization from pyLDAvis
        tot_vis: json data for topics over time visualization
    """
    os.environ["TOKENIZERS_PARALLELISM"] = "true"

    # CUDA check
    device = torch.device("cuda:0"
                          if torch.cuda.is_available()
                          else "cpu")
    print(f"[  train  ] Device:                 {device}")

    num_of_topics = args_dict["num_of_topics"]
    topic_num_words = args_dict["num_of_keywords"]
    docs_per_topic = args_dict["docs_per_topic"]
    embedding_dict = {
        'sentence_embeded_model': True,
        'flair_embedding_model': False,
        'spacy_transformer': False,
        'USE': False
    }

    # Preprocessing Timing start
    start = time.time()
    print('[  train  ] Data Preprocessing...')
    embedding_model, vectorizer_model,\
        docs, doc_ids, timestamps = main_preprocess(data_df, embedding_dict)

    end = time.time()
    print(f"[  train  ]     PREPROCESS TIME:    {end-start}")

    print('[  train  ] Model Training...')
    model, topics, probs = bert_model(docs,
                                      embedding_model,
                                      vectorizer_model,
                                      num_of_topics,
                                      topic_num_words,
                                      docs_per_topic,
                                      calc_probs=True,
                                      )

    score_dict = get_score_dict(model, topics, docs, doc_ids, topic_num_words)
    result_dict = eval.score_main(score_dict, model)

    idm_vis, tot_vis = bv.vis_main(model, docs, probs,
                                   num_of_topics, topics, timestamps)

    return result_dict, idm_vis, tot_vis

@MaartenGr
Copy link
Owner

@mdcox There has been some significant fixes since v0.9.4 and I believe that correctly finding representative documents might be one of them. I would advise using the newest version as those issues might already be fixed.

@mdcox
Copy link

mdcox commented Aug 11, 2022

@mdcox There has been some significant fixes since v0.9.4 and I believe that correctly finding representative documents might be one of them. I would advise using the newest version as those issues might already be fixed.

Okay good note! It turns out there was a small big that I just found as well Thank you very much for the help.... definitely will start using the updated version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants