Links to awesome open dataset.
http://trec.nist.gov/data/reuters/reuters.html
https://github.com/ciprian-chelba/1-billion-word-language-modeling-benchmark
https://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html
https://github.com/yoonkim/lstm-char-cnn/tree/master/data/ptb
http://sighan.cs.uchicago.edu/bakeoff2005/
http://learnbigcode.github.io/datasets/
http://www.srl.inf.ethz.ch/py150
http://groups.inf.ed.ac.uk/cup/naturalize/
http://check.useast.appfog.ctl.io/download
http://groups.inf.ed.ac.uk/cup/codeattention/
http://ahclab.naist.jp/pseudogen
https://archive.org/details/stackexchange
Related Paper:
Summarizing Source Code using a Neural Attention Model
http://www.aclweb.org/anthology/P16-1195
Summarizing Source Code: Python, SQL, C#