-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a bigger dataset to train the classifier #3745
Comments
This is a really interesting idea, and wanted to ask, what are exactly the limitations or problems to include code that is already open sourced ? or why could we just use sources like Rosetta Code (Disclaimer: I am not an expert in license as you may see) |
Two-fold: licenses and performance. See #2117 for more deets. |
I'm trying this with a generated dataset in a separate repository: https://github.com/smola/language-dataset Licensing is tricky but it could be solved, see this repo for an example: https://github.com/smola/language-dataset
|
Very interested in the numbers when you'll have them. In particular, I'd love to see how many samples we need to add for a given improvement in accuracy (to see how much effort we'd have to make on performance). |
This will still take some time. I have to get the dataset cleaned up to ensure that all samples are assigned to the right language. I'll post the benchmark once the dataset is accurate enough. |
This issue has been automatically marked as stale because it has not had activity in a long time. If this issue is still relevant and should remain open, please reply with a short explanation (e.g. "I have checked the code and this issue is still relevant because ___."). Thank you for your contributions. |
This issue has been automatically closed because it has not had activity in a long time. Please feel free to reopen it or create a new issue. |
The classifier that is built from samples in the
samples
directory can be improved a lot by training on a bigger dataset. However, it is sometimes difficult to get enough samples that are released under a liberal license.I would like to propose to built a separate project with the dataset, where each file retains each own license and authorship attribution and the dataset itself is released under a database license such as the ODbL 1.0.
As far as I know, this would work because the dataset project wouldn't constitute a derived work, at least in GPLv3 (see:
A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an “aggregate” if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate.
)There would still be the question about if the model built from this dataset (
samples.json
) would constitute a derived work from all its parts. I guess that, in order to do this, we would need some legal advice. And maybesamples.json
would need to get distributed separately, but I think this is a detail that could be worked out if you think the proposal is worth it.The text was updated successfully, but these errors were encountered: