Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corpus data in tests-tatcorpus #34

Open
jonorthwash opened this issue Feb 1, 2019 · 11 comments
Open

corpus data in tests-tatcorpus #34

jonorthwash opened this issue Feb 1, 2019 · 11 comments
Assignees
Labels
question Further information is requested

Comments

@jonorthwash
Copy link
Member

It's not clear to me that the corpus data in tests-tatcorpus/ should stay there. Things I worry about:

  • Licensing of the data: where's it from / is the license compatible with the rest of the repository.
  • Size: the data is pretty big, whereas most of the rest of the repo is not.
  • Relevance: this repo is for the analyser (and tests to maintain it, offshoots, experiments, etc.), not large corpus data.

@mansayk, could you make an argument to justify keeping the tests in the repository?

@jonorthwash jonorthwash added the question Further information is requested label Feb 1, 2019
@mansayk
Copy link
Member

mansayk commented Feb 1, 2019

Hi!

The tests-tatcorpus directory contains just a list of word forms, that I collected from the Corpus of Written Tatar. They are not taken from any dictionary.

This list can be used to:

  1. see effect of code changes;
  2. collect words unknown to analyser and add it to .lexc file.

I'd like to keep it there so Ilnar also could use it. If you think it is better to remove it from repository, I will do it immediately.

Thank you!

@jonorthwash
Copy link
Member Author

@IlnarSelimcan, @ftyers, what do you two think? I think using something like this for regression testing is good, but I still have the licensing concern (maybe less than originally) and the size concern.

@mansayk
Copy link
Member

mansayk commented Feb 1, 2019

Size can be reduced 2 times, because not all of those files are necessary: one of them just backup, another can be generated.

@TinoDidriksen
Copy link
Member

This repo is already 360 MiB in size. It's not enough that you delete a file - it's still part of the cloned data. Anything you add is part of the repo's history forever. Those big files should be removed and purged from history with a rewrite.

Of the 145 repos I track, it's in the top 15 size-wise.

@mansayk
Copy link
Member

mansayk commented Feb 1, 2019

Ok, I understand, I will remove those files right now and please help me purging them from history.

@mansayk
Copy link
Member

mansayk commented Feb 1, 2019

I removed the files, but I don't know how to purge them from repo's history. @TinoDidriksen could you, please, help me with that?

@jonorthwash
Copy link
Member Author

jonorthwash commented Feb 1, 2019

@mansayk, which files are you planning on keeping / didn't remove?

@TinoDidriksen
Copy link
Member

Repository trimmed - now down to 54 MiB, which is manageable. Everyone will have to re-clone from scratch. I've taken a backup of the repo before doing the trim, just in case.

@mansayk
Copy link
Member

mansayk commented Feb 2, 2019

@TinoDidriksen thank you so much for your help!

@jonorthwash I will keep that test files locally and I will use it periodically. If I find any regression then I will create an issue(s) + add some new rules to existing tests, ok? If you have a better idea, please, let me know. Thank you.

@IlnarSelimcan
Copy link
Member

I think I've found a better solution for this in 6dbcb19 . It seems to work, but improvements are welcome.

@IlnarSelimcan IlnarSelimcan reopened this Feb 18, 2019
@IlnarSelimcan
Copy link
Member

IlnarSelimcan commented Feb 18, 2019

One particular thing that should be done is to split the frequency list into many and pass them through tat-morph in parallel (using GNU Parallel tool or something similar).

IlnarSelimcan added a commit that referenced this issue Feb 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants