This is a python implementation of Probabilistic Latent Semantic Analysis using EM algorithm.
Support both English and Chinese.
Execute the following command in the cmd :
python plsa.py [datasetFilePath] [stopwordsFilePath] [K] [maxIteration] [threshold] [topicWordsNum] [docTopicDisFilePath] [topicWordDisFilePath] [dictionaryFilePath] [topicsFilePath]
eg.
python plsa.py dataset.txt stopwords.dic 10 30 1.0 10 doctopic.txt topicword.txt dictionary.dic topics.txt
or omit the params using default values specified in plsa.py :
python plsa.py
The meaning of params are given as following:
param | description |
---|---|
datasetFilePath | the file path of dataset |
stopwordsFilePath | the file path of stopwords |
K | the number of topic |
maxIteration | the max number of iteration of EM algorithm |
threshold | the threshold to judge the convergence of log likelihood |
topicWordsNum | the number of top words of each topic |
docTopicDisFilePath | the file path to output document-topic distribution |
topicWordDistribution | the file path to output topic-word distribution |
dictionaryFilePath | the file path to output dictionary |
topicsFilePath | the file path to output top words of each topic |
In the dataset file, each line represents a document.
In the stopwords file, each line represents a stopword.
The first dataset is 16 documents about one piece from wikipedia.
The result of top words is given as :
The params are set as :
python plsa.py dataset1.txt stopwords.dic 10 20 1.0 10 doctopic.txt topicword.txt dictionary.dic topics.txt
The second dataset is 100 documents from the Associated Press.
The result of top words is given as :
The params are set as :
python plsa.py dataset2.txt stopwords.dic 10 20 50.0 10 doctopic.txt topicword.txt dictionary.dic topics.txt
The third dataset is 50 documents from sina.
The result of top words is given as :
The params are set as :
python plsa.py dataset3.txt stopwords.dic 30 30 10.0 10 doctopic.txt topicword.txt dictionary.dic topics.txt
- ZhikaiZhang
- Email zhangzhikai@seu.edu.cn
- Blog http://zhikaizhang.cn
- 自然语言处理之PLSA