-
Notifications
You must be signed in to change notification settings - Fork 845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove sentences containing foreign-language proper nouns / last names #3786
Conversation
Are there already recordings of these sentences? Have you cross-checked with the data in the latest release of Common Voice? |
|
I may be wrong (we need to check with @mozgzh), but I think that if you remove sentences that already have recordings then we will lose access to those recordings. I doubt that all 687k are recorded, so it might be better just to remove the ones that don't yet have recordings, that's what I mean by cross-check with the last release. Btw, I definitely agree that it is a good idea to clean this stuff up. The use-case for having a model that can do ASR on spoken Wikipedia is pretty limited. |
I see. I'll check for that. Still, even if some already got records associated:
|
Regarding WER: Common Voice does not train models, it only releases data, so we have no way of knowing -- aside from reports from the community -- if the WER would go up or not. So an actionable step would to remove the ones without recordings, and then look for a solution for not presenting the ones that already have recordings to new users. You should also probably get in touch with other members of the French community in Common Voice. There is a Matrix channel for them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will let @ftyers review this, however I have one comment: removing the sentences from the sentence-collector.txt
file won't have any long-lasting effect. It will just get re-exported with the next automatic export. Therefore, could you revert the changes to that file?
Once this PR is ready to go, I would love to have a txt file with one sentence per line to remove from Sentence Collector. Then I can remove those from the DB.
769bd77
to
84e9cdf
Compare
Regarding usage of the 687k sentences : 144.899 recordings of 135.537 unique sentences (most of them having only one recording)
|
Ok, so the majority can be removed. Those 135.537 should be left. |
Updated the patch. rebased + various improvements + not removing sentences having a record + not altering Statistics summary:
¹ As an example: "Cette espèce est endémique du Queensland en Australie" has 629 occurrences (and there are many others)
|
Hi. I'm nobody but a random passenger. But I wanted to help, and to highlight that this proposition from @drzraf (#3786 (comment)) seems to answer this change request, as written in "not altering sentence-collector.txt but created issue3785_proper-nouns-deletion.txt instead" To be validated by the right people 😸 |
Average line-removal rate per text origin:
|
Specifically for Therefore that file can now be removed from the PR. |
Done. But then it should have been the same for |
That was meant to be fixed sentences, that previously were removed from the Sentence Collector. However that probably was only the fix for the encoding and not an actual review of the sentences themselves. So basically, to fix an encoding issue, we deleted sentences in Sentence Collector and then used a PR to readd fixed versions of them, as they already had been approved before. In your case, that file was a log of sentences to delete, therefore not "fixed sentences". Does that make sense? In any case, if that file contains buggy sentences, then that needs to be dealt with separately. |
Could we please get this one handled? |
He did and we agreed a cleanup would really benefit our next model. I trust the CV team to choose which mods are acceptable and which are not but I'm positive WER can only go down if we remove mistaken data. I cannot spend the energy required to make a comparison model just to show the impact of the currents mods on WER so I would really appreciate for some (or all) of those mods to be in the next CV release so that I can leverage them in my next training session. 🥰 |
ping? Could this be merged? |
Why wasn't this merged and stuck seemingly forever? This PR was discussed and explained in great length (on the forum and here), reviewed then edited accordingly and made as good as possible and it's likely to decrease the WER. Could the person who takes the responsibility of merging/rejecting them (or letting them lag) either merge or comment? Thank you. |
@moz-rotimib ? (seems the last human, here, having committed to this repo in the last months). |
Hi @drzraf, can you confirm that this does not remove any sentences that have recordings in v13 and then we can go ahead and merge it I think. @MichaelKohler thoughts? Apologies for the long wait! |
- The sentences to be removed from the sentence collector have been attached to PR common-voice#3786 instead of being actually removed from `sentence-collector.txt` - Sentences having at least one existing record (as of `cv-13.0-2023-03-09`) have been preserved.
Number of recordings per sentence:
Average removal per text origin:
Force pushed Attached here the new version of the sentences to be removed manually from the sentence-collector |
Dears, I strongly believe that due consideration of contributor' PRs is key in the success of an OSS and I'm worried about project's future when I see that even such a PR requires so much effort to get in (6 months and 22 comments/pings/reminders/... !) |
Hello @drzraf! The length of time you've had to wait isn't acceptable. I'm Jess, the technology community manager who has just recently started on the Common Voice project, largely to help make sure that we're not leaving our technical contributors and dataset users to wait for ages like we've done with you. You have my apologies and my task for April is to get through the backlog of PRs, issues and support queries here on GitHub and across 3 other platforms. While things may still feel a little slow while I'm addressing the backlog (and learning how everything works!), handling PRs, issues and support queries should feel much easier in the short term future. If you do find that things are stuck, there's also a designated person 👋 around part time to help unstick things. I do apologize again for the experience you've had. It's my goal to make sure that you (and everyone else!) doesn't have to deal with long waits like this in the future. If you have any additional feedback you can chat to me here, at jessicar@mozillafoundation.org or on Discourse or Matrix. |
The issue is about responsibility: Is there someone feeling responsible about FR sentences dataset quality and having merge permission? I'm asking because there is a lot to do to continue improving this dataset but that won't be possible without clear workflows and clearly identified reviewer(s)/merger(s). |
To spare this issue to getting longer with lot of "meta-discussion" and "pings", could you please, @jessicarose, tackle this issue via a direct contact with @MichaelKohler and/or any person(s) responsible for reviewing and merging community's PR in this project and then let us know here the final decision regarding this and future contributions? Thank you |
Thanks so much for waiting, we wanted to get a few behind the scenes issues with the Sentence Collector (common-voice/sentence-collector#668) ironed out before accepting this PR to make sure that there wouldn't be any recurring challenges with target sentences being re-imported. Who holds responsibility for different project areas is a very fair question to ask! At the moment, I'm going to be the primary point of contact with merge permissions for PRs across the different MCV repos, including FR datasets (though Gina is involved in this domain area. This may evolve as more team members onboard, but right now I'll merge this with my thanks and start chipping away at the backlog to prevent similar blockages. |
Type of Pull Request
@upstream: This PR implies a good share of CPU for text-processing and rebase (forced on by the weekly
Automatic Sentence Collector Export
takes eons to proceed)