Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adds mtedx valid and test data #136

Merged
merged 4 commits into from
Mar 5, 2021
Merged

Conversation

esalesky
Copy link
Contributor

Adds the valid and test data for Multilingual TEDx

@martinpopel
Copy link
Collaborator

  • Most other sacrebleu testsets contain a version or year in their name. Is mtedx and Multilingual TEDx so unique name? Isn't SLR100 a better name?
  • Don't you want link http://openslr.org/100 from the description field?
  • Don't you want link https://arxiv.org/abs/2102.01757 BibTeX from the citation field?
  • Is the provided githubusercontent.com link a permanent URL? GitHub is for versioning, but sacrebleu users expect to get exactly the same result with thesame sacrebleu command (and the same version of sacrebleu). Have you considered making a GitHub release or at least to tag that commit? I saw you uploaded the data also to https://osf.io/, but I am not sure if ofs.io provides a permanent URL.

As for the failing Travis-CI check, you can ignore it for now, it is obviously not related to this PR (I created a new issue #137 for it).

@esalesky
Copy link
Contributor Author

  • we are using mTEDx as the short identifier for the corpus in code and the paper; this is what I'd (hopefully) expect people to look for here
  • the github urls should be permanent; the github repo was made solely to provide permanent links for the test sets for sacrebleu. I did look into other sources (e.g. osf.io) for permanent links for just the evaluation set text files but osf for example masks the download urls for the underlying files. I added a v1.0 tag to the commit to be safe; we do not anticipate multiple releases for the evaluation data at this time though
  • I added the openslr link to the description
  • I added the arxiv citation

Copy link
Collaborator

@martinpopel martinpopel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing this test set.

@martinpopel
Copy link
Collaborator

I cannot merge this via GitHub because of Travis failing (#137). I could perhaps merge it via command line.
@mjpost, @ozancaglayan: should we drop support for Python3.5 (as a side effect we could use f-strings).

@ozancaglayan
Copy link
Collaborator

ozancaglayan commented Feb 15, 2021 via email

@mjpost
Copy link
Owner

mjpost commented Feb 15, 2021

I'd love to drop 3.5, but think we should wait for the 2.0 release. We could do that soon, though—ideally we'd add things like bootstrap resampling, #124, and maybe #125. I'd also like to output a JSON-formatted doc string, instead of the very verbose, hard-to-process one sacrebleu prints now.

@ozancaglayan
Copy link
Collaborator

Yes sure, #125 kind of stalled, would like to go back to it when I have some time.

@ozancaglayan
Copy link
Collaborator

Since we've fixed the 3.5 issue temporarily in master, this can now be merged after being rebased if you like.

@ozancaglayan ozancaglayan merged commit b7adb84 into mjpost:master Mar 5, 2021
@ozancaglayan
Copy link
Collaborator

I just gave this a try and apparently the md5 hashes of the underlying tarballs have changed for both test and valid. @esalesky

$ sacrebleu -t mtedx/test -l fr-en
sacreBLEU: Downloading https://raw.githubusercontent.com/esalesky/mtedx-eval/main/test.tar.gz to /home/ozan/.sacrebleu/mtedx/test/test.tar.gz
sacreBLEU: Fatal: MD5 sum of downloaded file was incorrect (got fa4cb1548c210ec424d7d6bc9a3675a7, expected ac79c5a5ab1b615eecd6dfbc163c0588).
sacreBLEU: Please manually delete "/home/ozan/.sacrebleu/mtedx/test/test.tar.gz" and rerun the command.
sacreBLEU: If the problem persists, the tarball may have changed, in which case, please contact the SacreBLEU maintainer.

@esalesky
Copy link
Contributor Author

esalesky commented Mar 5, 2021

Hi -- the hashes of the tar.gz files didn't change, but, were incorrect in the dataset.py file. I'm not sure now how that happened/escaped testing, but, below are the correct md5 hashes. When I clean my downloads and try it this works. I've updated the commit in my fork, whatever is best/easiest to update them works for me.

test:
fa4cb1548c210ec424d7d6bc9a3675a7
valid:
40618171614c50e6cbb5e5bbceee0635

@ozancaglayan
Copy link
Collaborator

Okay thanks! I'll now create a PR

@esalesky
Copy link
Contributor Author

esalesky commented Mar 5, 2021

Awesome, thank you! And thank you for catching that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants