Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a bigger dataset to train the classifier #3745

Closed
smola opened this issue Jul 31, 2017 · 7 comments
Closed

Use a bigger dataset to train the classifier #3745

smola opened this issue Jul 31, 2017 · 7 comments
Labels

Comments

@smola
Copy link
Contributor

smola commented Jul 31, 2017

The classifier that is built from samples in the samples directory can be improved a lot by training on a bigger dataset. However, it is sometimes difficult to get enough samples that are released under a liberal license.

I would like to propose to built a separate project with the dataset, where each file retains each own license and authorship attribution and the dataset itself is released under a database license such as the ODbL 1.0.

As far as I know, this would work because the dataset project wouldn't constitute a derived work, at least in GPLv3 (see: A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an “aggregate” if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate.)

There would still be the question about if the model built from this dataset (samples.json) would constitute a derived work from all its parts. I guess that, in order to do this, we would need some legal advice. And maybe samples.json would need to get distributed separately, but I think this is a detail that could be worked out if you think the proposal is worth it.

@diegocrzt
Copy link

This is a really interesting idea, and wanted to ask, what are exactly the limitations or problems to include code that is already open sourced ? or why could we just use sources like Rosetta Code (Disclaimer: I am not an expert in license as you may see)

@lildude
Copy link
Member

lildude commented Aug 1, 2017

This is a really interesting idea, and wanted to ask, what are exactly the limitations or problems to include code that is already open sourced ?

Two-fold: licenses and performance. See #2117 for more deets.

@smola
Copy link
Contributor Author

smola commented Aug 20, 2018

I'm trying this with a generated dataset in a separate repository: https://github.com/smola/language-dataset
So far, testing with leave-one-out cross validation is giving very good results.

Licensing is tricky but it could be solved, see this repo for an example: https://github.com/smola/language-dataset
The key would be:

  • Using a separate repository.
  • License with a database license such as ODbl https://opendatacommons.org/licenses/odbl/
  • Ensure that each of the included files has a license that allows redistribution.
  • Add an index file with a link to the original source and the license.

@smola smola closed this as completed Aug 20, 2018
@smola smola reopened this Aug 20, 2018
@pchaigno
Copy link
Contributor

So far, testing with leave-one-out cross validation is giving very good results.

Very interested in the numbers when you'll have them. In particular, I'd love to see how many samples we need to add for a given improvement in accuracy (to see how much effort we'd have to make on performance).

@smola
Copy link
Contributor Author

smola commented Aug 21, 2018

This will still take some time. I have to get the dataset cleaned up to ensure that all samples are assigned to the right language. I'll post the benchmark once the dataset is accurate enough.

@stale
Copy link

stale bot commented Nov 6, 2018

This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions.

@stale stale bot added the Stale label Nov 6, 2018
@stale
Copy link

stale bot commented Nov 20, 2018

This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue.

@stale stale bot closed this as completed Nov 20, 2018
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants