-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting started datasets #717
Comments
Hello, I also think it is nice to provide a nice API to download the corpus. So anyone works on this? |
Some nice references for how to implement: |
Resolved in #1705. |
@menshikh-iv which of the datasets from @macks22 are there? #1705 is rather terse on information. We especially want all the practical, domain-specific corpora: USPTO patents, the Quora QA duplicates dataset, the PubMed corpus… Having a clean, uniform interface for downloading and opening these corpora from Python is already super useful. |
The tutorial dataset is rather small and the Wikipedia dataset is rather large. It would be nice to provide some datasets that are somewhere in between to help people getting started.
We could also provide a nice interface to download these datasets, similar to scikit-learn that aids at downloading these datasets. This will simplify the tutorials.
The text was updated successfully, but these errors were encountered: