Stack Overflow Topic Clustering

For a summary see the Info Poster.

Stackexchange Data Crawling

https://api.stackexchange.com/docs

User Authorized with account id = USER_ID, got access token = ACCESS_TOKEN

Install

sudo apt-get install node npm python3
npm install
pip install -U nltk 
python3 
> import nltk
> nltk.download('stopwords')
> nltk.download('punkt')
> nltk.download('averaged_perceptron_tagger')
> nltk.download('universal_tagset')
> nltk.download('wordnet')

Pipeline:

Use root folder as working path:

JS: node src/webget.js
PY: python3 src/lemmatizing.py

Module	Input	Output
1. webget.js	-	#.json[]
2. merge.js	#.json[]	merge.json, merge-meta.json
3. htmlclean.py	merge.json	htmlcleaned.json, htmlcleaned-meta.json
4. stemming.py	htmlcleaned.json	stemming.json, stemming-meta.json
5. lemmatizing.py	htmlcleaned.json	lemmatizing.json, lemmatizing-meta.json
6. tfidf.py	lemmatizing.json	itf.json, doc-vecs.json
7. ngram.py	ngrams.json	ngram-itf.json, doc-ngrams.json
8. classification		classify.ipynb
9. clusterning		cluster.ipynb

Stemmer Algorithms Differences:

The three major stemming algorithms in use today are Porter, Snowball(Porter2), and Lancaster (Paice-Husk), with the aggressiveness continuum basically following along those same lines.

Porter: Most commonly used stemmer without a doubt, also one of the most gentle stemmers. One of the few stemmers that actually has Java support which is a plus, though it is also the most computationally intensive of the algorithms(Granted not by a very significant margin). It is also the oldest stemming algorithm by a large margin.
Snowball (=Porter2): Nearly universally regarded as an improvement over porter, and for good reason. Porter himself in fact admits that it is better than his original algorithm. Slightly faster computation time than porter, with a fairly large community around it.
Lancaster: Very aggressive stemming algorithm, sometimes to a fault. With porter and snowball, the stemmed representations are usually fairly intuitive to a reader, not so with Lancaster, as many shorter words will become totally obfuscated. The fastest algorithm here, and will reduce your working set of words hugely, but if you want more distinction, not the tool you would want.

Gulp Pipeline Commands

gulp download
gulp convert
python3 src/sentences.py
python3 src/terms.py
python3 src/stemming.py
python3 src/lemming.py
python3 src/ngram.py 
python3 src/ngram.py

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.ipynb_checkpoints		.ipynb_checkpoints
.vscode		.vscode
dist		dist
doc		doc
img		img
res/stackoverflow		res/stackoverflow
src		src
.gitignore		.gitignore
README.md		README.md
classify.ipynb		classify.ipynb
cluster.ipynb		cluster.ipynb
gP.pkl		gP.pkl
gR.pkl		gR.pkl
gf1.pkl		gf1.pkl
gulpfile.js		gulpfile.js
package-lock.json		package-lock.json
package.json		package.json
stackoverflow.ipynb		stackoverflow.ipynb
tfidf.ipynb		tfidf.ipynb
tpcs_as.pkl		tpcs_as.pkl
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stack Overflow Topic Clustering

Stackexchange Data Crawling

Install

Pipeline:

Stemmer Algorithms Differences:

Gulp Pipeline Commands

About

Releases

Packages

Contributors 2

Languages

glouwa/StackOverflowTopicClustering

Folders and files

Latest commit

History

Repository files navigation

Stack Overflow Topic Clustering

Stackexchange Data Crawling

Install

Pipeline:

Stemmer Algorithms Differences:

Gulp Pipeline Commands

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages