Getting started datasets #717

bartjkdp · 2016-06-02T20:06:35Z

The tutorial dataset is rather small and the Wikipedia dataset is rather large. It would be nice to provide some datasets that are somewhere in between to help people getting started.

We could also provide a nice interface to download these datasets, similar to scikit-learn that aids at downloading these datasets. This will simplify the tutorials.

robotcator · 2016-11-01T11:49:40Z

Hello, I also think it is nice to provide a nice API to download the corpus. So anyone works on this?
As I often use the text8.zip dataset in my experiment, I write an script to download from the Internet. (https://gist.github.com/robotcator/1fb0cdc1437515f5662d33368554f4c8)

macks22 · 2017-06-05T15:53:10Z

Some good candidate datasets are:

macks22 · 2017-06-08T14:53:33Z

Some nice references for how to implement:

menshikh-iv · 2017-11-14T08:33:20Z

Resolved in #1705.

piskvorky · 2017-11-14T12:26:08Z

@menshikh-iv which of the datasets from @macks22 are there? #1705 is rather terse on information.

We especially want all the practical, domain-specific corpora: USPTO patents, the Quora QA duplicates dataset, the PubMed corpus…

Having a clean, uniform interface for downloading and opening these corpora from Python is already super useful.

piskvorky assigned tmylk Jun 2, 2016

tmylk added the wishlist Feature request label Jun 29, 2016

tmylk added feature Issue described a new feature difficulty medium Medium issue: required good gensim understanding & python skills labels Oct 6, 2016

macks22 mentioned this issue Jun 4, 2017

Add TextDirectoryCorpus that yields one doc per file recursively read from directory #1387

Closed

macks22 mentioned this issue Jun 28, 2017

Data/Model storage #1453

Closed

menshikh-iv closed this as completed Nov 14, 2017

menshikh-iv unassigned tmylk Nov 14, 2017

menshikh-iv mentioned this issue Nov 14, 2017

Add more datasets/models to gensim-data #1717

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting started datasets #717

Getting started datasets #717

bartjkdp commented Jun 2, 2016 •

edited

Loading

robotcator commented Nov 1, 2016

macks22 commented Jun 5, 2017

macks22 commented Jun 8, 2017

menshikh-iv commented Nov 14, 2017

piskvorky commented Nov 14, 2017 •

edited

Loading

Getting started datasets #717

Getting started datasets #717

Comments

bartjkdp commented Jun 2, 2016 • edited Loading

robotcator commented Nov 1, 2016

macks22 commented Jun 5, 2017

macks22 commented Jun 8, 2017

menshikh-iv commented Nov 14, 2017

piskvorky commented Nov 14, 2017 • edited Loading

bartjkdp commented Jun 2, 2016 •

edited

Loading

piskvorky commented Nov 14, 2017 •

edited

Loading