eth_py150_open

A redistributable subset of the ETH Py150 corpus [https://www.sri.inf.ethz.ch/py150], introduced in the ICML 2020 paper 'Learning and Evaluating Contextual Embedding of Source Code' [https://proceedings.icml.cc/static/paper_files/icml/2020/5401-Paper.pdf].

This release only makes available a manifest that selects some of the files in the original ETH Py150 corpus. The manifest was composed by tracking down the license of each included GitHub repository, and selecting only the files from those repositories with one of the following licenses:

'apache-2.0',
'lgpl-2.1',
'epl-1.0',
'isc',
'bsd-3-clause',
'bsd-2-clause',
'mit',
'gpl-2.0',
'cc0-1.0',
'lgpl-3.0',
'mpl-2.0',
'unlicense',
'gpl-3.0'

We have excluded files from repositories that no longer appear in public GitHub, that have licenses that do not appear in this license list, that mix licenses, or that apply the license incorrectly.

We provide 3 JSON-formatted manifest files, one for each split (dev, eval, and train). The dev and train splits correspond to the train split of the original ETH Py150 corpus, and is a 90--10 split by file. If no validation split is required, users may combine the dev and train splits into a single train split.

Each manifest is a list of file specifications with the following fields:

filepath: string; the full path (within GitHub) of a file retained from the original ETH Py150 Dataset.
license: string (one of 'apache-2.0', 'lgpl-2.1', 'epl-1.0', 'isc', 'bsd-3-clause', 'bsd-2-clause', 'mit', 'gpl-2.0', 'cc0-1.0', 'lgpl-3.0', 'mpl-2.0', 'unlicense', 'gpl-3.0'); the license under which the file was released on GitHub.

To use this manifest, one may download the ETH Py150 repository from its original location, and then retain only the files with the file paths included in the corresponding manifest we release.

The sizes of three splits are as follows:

Dataset split	Number of source files
`dev`	8302
`train`	74749
`eval`	41457

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

eth_py150_open

Files

README.md

Latest commit

History

README.md

File metadata and controls

eth_py150_open