Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting started datasets #717

Closed
bartjkdp opened this issue Jun 2, 2016 · 5 comments
Closed

Getting started datasets #717

bartjkdp opened this issue Jun 2, 2016 · 5 comments
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request

Comments

@bartjkdp
Copy link

bartjkdp commented Jun 2, 2016

The tutorial dataset is rather small and the Wikipedia dataset is rather large. It would be nice to provide some datasets that are somewhere in between to help people getting started.

We could also provide a nice interface to download these datasets, similar to scikit-learn that aids at downloading these datasets. This will simplify the tutorials.

@tmylk tmylk added the wishlist Feature request label Jun 29, 2016
@tmylk tmylk added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 6, 2016
@robotcator
Copy link
Contributor

Hello, I also think it is nice to provide a nice API to download the corpus. So anyone works on this?
As I often use the text8.zip dataset in my experiment, I write an script to download from the Internet. (https://gist.github.com/robotcator/1fb0cdc1437515f5662d33368554f4c8)

@macks22
Copy link
Contributor

macks22 commented Jun 8, 2017

Some nice references for how to implement:

@menshikh-iv
Copy link
Contributor

Resolved in #1705.

@piskvorky
Copy link
Owner

piskvorky commented Nov 14, 2017

@menshikh-iv which of the datasets from @macks22 are there? #1705 is rather terse on information.

We especially want all the practical, domain-specific corpora: USPTO patents, the Quora QA duplicates dataset, the PubMed corpus…

Having a clean, uniform interface for downloading and opening these corpora from Python is already super useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty medium Medium issue: required good gensim understanding & python skills feature Issue described a new feature wishlist Feature request
Projects
None yet
Development

No branches or pull requests

6 participants