GPyM_TM

GPyM_TM is a Python package to perform topic modelling, either through the use of the Dirichlet multinomial mixture model (GSDMM) [1] or the Gamma Poisson mixture model (GPM) [2]. Each of the above models is available within the package in a separate class, namely GSDMM and GPM, respectively. The package is also available on Pypi.

Preamble

The aim of topic modelling is to extract latent topics from large corpora. GSDMM [1] and GPM [2] assume each document belongs to a single topic, which is a suitable assumption for some short texts. Given an initial number of topics, K, this algorithm clusters documents and extracts the topical structures present within the corpus. If K is set to a high value, then the models will also automatically learn the number of clusters.

[1] Yin, J. and Wang, J., 2014, August. A Dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 233-242)

[2] Mazarura, J., de Waal, A. and de Villiers, P., 2020. A Gamma-Poisson Mixture Topic Model for Short Text. Mathematical Problems in Engineering, 2020

Further details about the GPM can be found in my thesis here.

Getting Started:

The package is available online for use within Python 3 enviroments.

The installation can be performed through the use of a standard 'pip' install command, as provided below:

pip install GPyM-TM

Prerequisites:

The package has several dependencies, namely:

numpy
random
math
pandas
re
nltk
gensim
scipy

GSDMM

Function and class description:

The class is named GSDMM, while the function itself is named DMM.

The function can take 6 possible arguments, two of which are required, and the remaining 4 being optional.

The required arguments are:

corpus - text file, which has been cleaned and loaded into Python. That is, the text should all be lowercase, all punctuation and numbers should have also been removed.
nTopics - the number of topics.

The optional requirements are:

alpha, beta - these are the distribution specific parameters.(The defaults for both of these parameters are 0.1.)
nTopWords - number of top words per a topic.(The default is 10.)
iters - number of Gibbs sampler iterations.(The default is 15.)

Output:

The function provides several components of output, namely:

psi - topic x word matrix.
theta - document x topic matrix.
topics - the top words per topic.
assignments - the topic numbers of selected topics only, as well as the final topic assignments.
Final k - the final number of selected topics.
coherence - the coherence score, which is a performance measure.
selected_theta
selected_psi

GPM

Function and class description:

The class is named GPM, while the function itself is named GPM.

The function can take 8 possible arguments, two of which are required, and the remaining 6 being optional.

The required arguments are:

corpus - text file, which has been cleaned and loaded into Python. That is, the text should all be lowercase, all punctuation and numbers should have also been removed.
nTopics - the number of topics.

The optional requirements are:

alpha, beta and gam - these are the distribution specific parameters.(The defaults for these parameters are alpha = 0.001, beta = 0.001 and gam = 0.1 respectively.)
nTopWords - number of top words per a topic.(The default is 10.)
iters - number of Gibbs sampler iterations.(The default is 15.)
N - this is a parameter used to normalize the document lengths, which is required for the Poisson model.

Output:

The function provides several components of output, namely:

psi - topic x word matrix.
theta - document x topic matrix.
topics - the top words per topic.
assignments - the topic numbers of selected topics only, as well as the final topic assignments.
Final k - the final number of selected topics.
coherence - the coherence score, which is a performance measure.
selected_theta
selected_psi

Example Usage:

A more comprehensive tutorial is also available.

Installation;

Run the following command within a Python command window:

pip install GPym_TM

Implementation;

Import the package into the relevant python script, with the following:

from GPyM_TM import GSDMM from GPyM_TM import GPM

Call the class:

Possible examples of calling the GSDMM function are as follows:

data_DMM = GSDMM.DMM(corpus, nTopics)

data_DMM = GSDMM.DMM(corpus, nTopics, alpha = 0.25, beta = 0.15, nTopWords = 12, iters =5)

Possible examples of calling the GPM function are as follows:

data_GPM = GPM.GPM(corpus, nTopics)

data_GPM = GPM.GPM(corpus, nTopics, alpha = 0.002, beta = 0.03, gam = 0.06, nTopWords = 12, iters = 7, N = 8)

Results;

The output obtained for the Dirichlet multinomial mixture model appears as follows:

While, the output obtained for the Poisson model appears as follows:

Built With:

Google Collab - Web framework

Python - Programming language of choice

Pypi - Distribution

Authors:

Jocelyn Mazarura

Co-Authors:

I would like to extend a special thank you to my colleagues Alta de Waal and Ricardo Marques. None of this would have been possible without either of you.

Thank you!

License:

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments:

University of Pretoria

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Data		Data
GPyM_TM		GPyM_TM
Images		Images
LICENSE.txt		LICENSE.txt
README.md		README.md
Tutorial.ipynb		Tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPyM_TM

Preamble

Getting Started:

Prerequisites:

GSDMM

Function and class description:

The required arguments are:

The optional requirements are:

Output:

GPM

Function and class description:

The required arguments are:

The optional requirements are:

Output:

Example Usage:

Installation;

Implementation;

Possible examples of calling the GSDMM function are as follows:

Possible examples of calling the GPM function are as follows:

Results;

Built With:

Authors:

Co-Authors:

License:

Acknowledgments:

About

Releases

Packages

Contributors 2

Languages

License

jrmazarura/GPM

Folders and files

Latest commit

History

Repository files navigation

GPyM_TM

Preamble

Getting Started:

Prerequisites:

GSDMM

Function and class description:

The required arguments are:

The optional requirements are:

Output:

GPM

Function and class description:

The required arguments are:

The optional requirements are:

Output:

Example Usage:

Installation;

Implementation;

Possible examples of calling the GSDMM function are as follows:

Possible examples of calling the GPM function are as follows:

Results;

Built With:

Authors:

Co-Authors:

License:

Acknowledgments:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages