Use a bigger dataset to train the classifier #3745

smola · 2017-07-31T09:45:10Z

The classifier that is built from samples in the samples directory can be improved a lot by training on a bigger dataset. However, it is sometimes difficult to get enough samples that are released under a liberal license.

I would like to propose to built a separate project with the dataset, where each file retains each own license and authorship attribution and the dataset itself is released under a database license such as the ODbL 1.0.

As far as I know, this would work because the dataset project wouldn't constitute a derived work, at least in GPLv3 (see: A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an “aggregate” if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate.)

There would still be the question about if the model built from this dataset (samples.json) would constitute a derived work from all its parts. I guess that, in order to do this, we would need some legal advice. And maybe samples.json would need to get distributed separately, but I think this is a detail that could be worked out if you think the proposal is worth it.

The text was updated successfully, but these errors were encountered:

diegocrzt · 2017-08-01T15:35:36Z

This is a really interesting idea, and wanted to ask, what are exactly the limitations or problems to include code that is already open sourced ? or why could we just use sources like Rosetta Code (Disclaimer: I am not an expert in license as you may see)

lildude · 2017-08-01T15:58:11Z

This is a really interesting idea, and wanted to ask, what are exactly the limitations or problems to include code that is already open sourced ?

Two-fold: licenses and performance. See #2117 for more deets.

smola · 2018-08-20T08:59:51Z

I'm trying this with a generated dataset in a separate repository: https://github.com/smola/language-dataset
So far, testing with leave-one-out cross validation is giving very good results.

Licensing is tricky but it could be solved, see this repo for an example: https://github.com/smola/language-dataset
The key would be:

Using a separate repository.
License with a database license such as ODbl https://opendatacommons.org/licenses/odbl/
Ensure that each of the included files has a license that allows redistribution.
Add an index file with a link to the original source and the license.

pchaigno · 2018-08-20T09:43:17Z

So far, testing with leave-one-out cross validation is giving very good results.

Very interested in the numbers when you'll have them. In particular, I'd love to see how many samples we need to add for a given improvement in accuracy (to see how much effort we'd have to make on performance).

smola · 2018-08-21T08:41:34Z

This will still take some time. I have to get the dataset cleaned up to ensure that all samples are assigned to the right language. I'll post the benchmark once the dataset is accurate enough.

stale · 2018-11-06T07:14:15Z

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

stale · 2018-11-20T08:14:02Z

This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.

smola closed this as completed Aug 20, 2018

smola reopened this Aug 20, 2018

stale bot added the Stale label Nov 6, 2018

stale bot closed this as completed Nov 20, 2018

smola mentioned this issue Dec 5, 2020

New Centroid-based Classifier #5103

Merged

github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a bigger dataset to train the classifier #3745

Use a bigger dataset to train the classifier #3745

smola commented Jul 31, 2017

diegocrzt commented Aug 1, 2017

lildude commented Aug 1, 2017

smola commented Aug 20, 2018

pchaigno commented Aug 20, 2018

smola commented Aug 21, 2018 •

edited

Loading

stale bot commented Nov 6, 2018

stale bot commented Nov 20, 2018

Use a bigger dataset to train the classifier #3745

Use a bigger dataset to train the classifier #3745

Comments

smola commented Jul 31, 2017

diegocrzt commented Aug 1, 2017

lildude commented Aug 1, 2017

smola commented Aug 20, 2018

pchaigno commented Aug 20, 2018

smola commented Aug 21, 2018 • edited Loading

stale bot commented Nov 6, 2018

stale bot commented Nov 20, 2018

smola commented Aug 21, 2018 •

edited

Loading