open-dataset/README.md at master · betterenvi/open-dataset · GitHub

Open Dataset

Links to awesome open dataset.

Some repos

https://github.com/src-d/awesome-machine-learning-on-source-code

Text

Reuters Corpora (RCV1, RCV2, TRC2)

http://trec.nist.gov/data/reuters/reuters.html

1b lm benchmark

https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark

20 newsgroups

https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html

SemEval 2015 Twitter benchmark

Microsoft Paraphrase Corpus (MSRP)

PTB

https://github.com/yoonkim/lstm-char-cnn/tree/master/data/ptb

Question Answering (QA)

WikiQA

QASent

Chinese Word Segmentation

SIGHAN 2005

http://sighan.cs.uchicago.edu/bakeoff2005/

Source Code

http://learnbigcode.github.io/datasets/

150k Python parsed ASTs

http://www.srl.inf.ethz.ch/py150

Java Variable and Method Naming Dataset and Embeddings

http://groups.inf.ed.ac.uk/cup/naturalize/

Similarity of code fragments Dataset

http://check.useast.appfog.ctl.io/download

Method Naming Dataset

http://groups.inf.ed.ac.uk/cup/codeattention/

Parallel Django Dataset: line-by-line English annotation

http://ahclab.naist.jp/pseudogen

Stack Overflow

https://archive.org/details/stackexchange

Related Paper:

Summarizing Source Code using a Neural Attention Model

http://www.aclweb.org/anthology/P16-1195

Summarizing Source Code: Python, SQL, C#

https://github.com/sriniiyer/codenn