Skip to content

betterenvi/open-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Open Dataset

Links to awesome open dataset.

Some repos

Text

Reuters Corpora (RCV1, RCV2, TRC2)

http://trec.nist.gov/data/reuters/reuters.html

1b lm benchmark

https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark

20 newsgroups

https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html

SemEval 2015 Twitter benchmark
Microsoft Paraphrase Corpus (MSRP)
PTB

https://github.com/yoonkim/lstm-char-cnn/tree/master/data/ptb

Question Answering (QA)

WikiQA
QASent

Chinese Word Segmentation

SIGHAN 2005

http://sighan.cs.uchicago.edu/bakeoff2005/

Source Code

http://learnbigcode.github.io/datasets/

150k Python parsed ASTs

http://www.srl.inf.ethz.ch/py150

Java Variable and Method Naming Dataset and Embeddings

http://groups.inf.ed.ac.uk/cup/naturalize/

Similarity of code fragments Dataset

http://check.useast.appfog.ctl.io/download

Method Naming Dataset

http://groups.inf.ed.ac.uk/cup/codeattention/

Parallel Django Dataset: line-by-line English annotation

http://ahclab.naist.jp/pseudogen

Stack Overflow

https://archive.org/details/stackexchange

Related Paper:

Summarizing Source Code using a Neural Attention Model

http://www.aclweb.org/anthology/P16-1195

Summarizing Source Code: Python, SQL, C#

https://github.com/sriniiyer/codenn