Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove sentences containing foreign-language proper nouns / last names #3786

Merged
merged 1 commit into from
Apr 12, 2023

Conversation

drzraf
Copy link
Contributor

@drzraf drzraf commented Sep 7, 2022

Type of Pull Request

@upstream: This PR implies a good share of CPU for text-processing and rebase (forced on by the weekly Automatic Sentence Collector Export takes eons to proceed)

@drzraf drzraf requested a review from a team as a code owner September 7, 2022 14:11
@drzraf drzraf requested review from data-sync-user and removed request for a team September 7, 2022 14:11
@ftyers
Copy link
Collaborator

ftyers commented Sep 7, 2022

Are there already recordings of these sentences? Have you cross-checked with the data in the latest release of Common Voice?

@drzraf
Copy link
Contributor Author

drzraf commented Sep 7, 2022

  • Sure there are already one or multiple recording of many of these ~ 687k sentences
  • What do you mean by "Cross-checked with the data in the latest release of Common Voice"? cross-checked what?

@ftyers
Copy link
Collaborator

ftyers commented Sep 7, 2022

I may be wrong (we need to check with @mozgzh), but I think that if you remove sentences that already have recordings then we will lose access to those recordings. I doubt that all 687k are recorded, so it might be better just to remove the ones that don't yet have recordings, that's what I mean by cross-check with the last release. Btw, I definitely agree that it is a good idea to clean this stuff up. The use-case for having a model that can do ASR on spoken Wikipedia is pretty limited.

@drzraf
Copy link
Contributor Author

drzraf commented Sep 7, 2022

I see. I'll check for that. Still, even if some already got records associated:

  • Offering them to new contributors is not good anyway
  • If the WER goes up, it would confirm that removing these sentence is adequate even if recordings already exist
  • If it goes down (would be really strange), then removing the unrecorded ones only, could be, indeed, a good and safe first step

@ftyers
Copy link
Collaborator

ftyers commented Sep 7, 2022

Regarding WER: Common Voice does not train models, it only releases data, so we have no way of knowing -- aside from reports from the community -- if the WER would go up or not.

So an actionable step would to remove the ones without recordings, and then look for a solution for not presenting the ones that already have recordings to new users. You should also probably get in touch with other members of the French community in Common Voice. There is a Matrix channel for them.

Copy link
Member

@MichaelKohler MichaelKohler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will let @ftyers review this, however I have one comment: removing the sentences from the sentence-collector.txt file won't have any long-lasting effect. It will just get re-exported with the next automatic export. Therefore, could you revert the changes to that file?

Once this PR is ready to go, I would love to have a txt file with one sentence per line to remove from Sentence Collector. Then I can remove those from the DB.

@drzraf drzraf force-pushed the fix-3785 branch 2 times, most recently from 769bd77 to 84e9cdf Compare September 7, 2022 18:53
@drzraf
Copy link
Contributor Author

drzraf commented Sep 7, 2022

Regarding usage of the 687k sentences : 144.899 recordings of 135.537 unique sentences (most of them having only one recording)

    1.0000 -     1.0000 [129436]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    1.0000 -     2.0000 [  3327]: ∎
    2.0000 -     3.0000 [  2352]: ∎
    3.0000 -     4.0000 [   395]: 
    4.0000 -     5.0000 [     9]: 
    5.0000 -     6.0000 [     6]: 
    6.0000 -     7.0000 [     6]: 
    7.0000 -     8.0000 [     4]: 
    8.0000 -     9.0000 [     2]: 

@ftyers
Copy link
Collaborator

ftyers commented Sep 12, 2022

Ok, so the majority can be removed. Those 135.537 should be left.

@drzraf
Copy link
Contributor Author

drzraf commented Sep 13, 2022

Updated the patch. rebased + various improvements + not removing sentences having a record + not altering sentence-collector.txt but created issue3785_proper-nouns-deletion.txt instead.

Statistics summary:

  • 273.205 tokens to identified for removal
  • ... matching 663.925 unique sentences (708.340 accounting for duplicates) ¹
  • But 139.534 of them intersected with the 491.052 (unique) sentences having a record (see histogram ²)
  • That made up 142.383 recorded sentences to preserve and 524.391 unique, unrecorded, sentences to remove (562.092 actual sentences removed accounting for duplicates)

¹ As an example: "Cette espèce est endémique du Queensland en Australie" has 629 occurrences (and there are many others)
² Sentences initially queued for removal but for whose one or multiple record(s) exist

    1.0000 -     1.0000 [136794]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    1.0000 -     2.0000 [  2674]: ∎
    2.0000 -     3.0000 [    35]: 
    3.0000 -     4.0000 [    22]: 
    4.0000 -     6.0000 [     9]:

@CapitainFlam
Copy link
Contributor

CapitainFlam commented Sep 13, 2022

I will let @ftyers review this, however I have one comment: removing the sentences from the sentence-collector.txt file won't have any long-lasting effect. It will just get re-exported with the next automatic export. Therefore, could you revert the changes to that file?

Once this PR is ready to go, I would love to have a txt file with one sentence per line to remove from Sentence Collector. Then I can remove those from the DB.

Hi. I'm nobody but a random passenger.

But I wanted to help, and to highlight that this proposition from @drzraf (#3786 (comment)) seems to answer this change request, as written in "not altering sentence-collector.txt but created issue3785_proper-nouns-deletion.txt instead"

To be validated by the right people 😸

@drzraf
Copy link
Contributor Author

drzraf commented Sep 13, 2022

Average line-removal rate per text origin:

  • vfvv 0.65%
  • Gutenberg 3.55%
  • libretheatre : 5.32%
  • assemblee-nationale : Would have been around 5% but mostly unchanged because most of them had existing records
  • wiki : 26.40%

@MichaelKohler
Copy link
Member

Specifically for server/data/fr/issue3785_proper-nouns-deletion.txt: if we keep this file these sentences will be tried to be imported on every start of the service. However, I'm attaching it here as an attachment for later, when we delete the sentences from Sentence Collector.

Therefore that file can now be removed from the PR.

to-be-deleted-sentence-collector.txt

@drzraf
Copy link
Contributor Author

drzraf commented Sep 13, 2022

Therefore that file can now be removed from the PR.

Done.

But then it should have been the same for issue2259_deleted_export_readd_fixed.txt which essentially contain a lot of buggy sentences, shouldn't it?

@MichaelKohler
Copy link
Member

But then it should have been the same for issue2259_deleted_export_readd_fixed.txt which essentially contain a lot of buggy sentences, shouldn't it?

That was meant to be fixed sentences, that previously were removed from the Sentence Collector. However that probably was only the fix for the encoding and not an actual review of the sentences themselves. So basically, to fix an encoding issue, we deleted sentences in Sentence Collector and then used a PR to readd fixed versions of them, as they already had been approved before.

In your case, that file was a log of sentences to delete, therefore not "fixed sentences". Does that make sense?

In any case, if that file contains buggy sentences, then that needs to be dealt with separately.

@drzraf
Copy link
Contributor Author

drzraf commented Sep 20, 2022

Could we please get this one handled?

@wasertech
Copy link

Regarding WER: Common Voice does not train models, it only releases data, so we have no way of knowing -- aside from reports from the community -- if the WER would go up or not.

You should also probably get in touch with other members of the French community in Common Voice.

He did and we agreed a cleanup would really benefit our next model. I trust the CV team to choose which mods are acceptable and which are not but I'm positive WER can only go down if we remove mistaken data.

I cannot spend the energy required to make a comparison model just to show the impact of the currents mods on WER so I would really appreciate for some (or all) of those mods to be in the next CV release so that I can leverage them in my next training session. 🥰

@drzraf
Copy link
Contributor Author

drzraf commented Oct 17, 2022

ping? Could this be merged?

@drzraf
Copy link
Contributor Author

drzraf commented Mar 5, 2023

Why wasn't this merged and stuck seemingly forever? This PR was discussed and explained in great length (on the forum and here), reviewed then edited accordingly and made as good as possible and it's likely to decrease the WER. Could the person who takes the responsibility of merging/rejecting them (or letting them lag) either merge or comment? Thank you.

@MichaelKohler / @ftyers

@drzraf
Copy link
Contributor Author

drzraf commented Mar 24, 2023

@moz-rotimib ? (seems the last human, here, having committed to this repo in the last months).

@ftyers
Copy link
Collaborator

ftyers commented Mar 27, 2023

Hi @drzraf, can you confirm that this does not remove any sentences that have recordings in v13 and then we can go ahead and merge it I think. @MichaelKohler thoughts? Apologies for the long wait!

- The sentences to be removed from the sentence collector have been attached to PR common-voice#3786 instead of being actually removed from `sentence-collector.txt`
- Sentences having at least one existing record (as of `cv-13.0-2023-03-09`) have been preserved.
@drzraf
Copy link
Contributor Author

drzraf commented Mar 28, 2023

  • 273.205 tokens to identified for removal
  • ... matching 677.861 unique sentences
  • ... 160.153 of these intersected with the 541.529 (unique) sentences having a record (see histogram)
  • That made up 160.153 recorded sentences to preserve and 509.175 unique, unrecorded, sentences to remove (545.767 actual sentences removed from the repository accounting for duplicates and 534.534 (the diff size) omitting the change to sentence-collector.txt

Number of recordings per sentence:

    1.0000 -     1.0000 [ 25436]: ∎∎∎∎
    1.0000 -     2.0000 [461089]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    2.0000 -     3.0000 [ 17497]: ∎∎
    3.0000 -     4.0000 [ 29132]: ∎∎∎∎
    4.0000 -     5.0000 [  5708]: 
    5.0000 -     6.0000 [   729]: 
    6.0000 -     7.0000 [   820]: 
    7.0000 -     8.0000 [   445]: 
    8.0000 -  1492.0000 [   673]: 

wiki-* files still account for 97.8% of the diff.

Average removal per text origin:

  • vfvv basically no change
  • Gutenberg 3.68%
  • libretheatre : 5.62%
  • assemblee-nationale : basically no change
  • wiki : 34.40%

Force pushed

Attached here the new version of the sentences to be removed manually from the sentence-collector
to-be-deleted-sentence-collector.txt

@drzraf
Copy link
Contributor Author

drzraf commented Mar 30, 2023

Dears,
with all due respect, I won't rebase this new/refreshed PR anymore if it happened to languish around another x months (until after a new CV 14 dataset is released).

I strongly believe that due consideration of contributor' PRs is key in the success of an OSS and I'm worried about project's future when I see that even such a PR requires so much effort to get in (6 months and 22 comments/pings/reminders/... !)

@jessicarose
Copy link
Collaborator

Hello @drzraf! The length of time you've had to wait isn't acceptable. I'm Jess, the technology community manager who has just recently started on the Common Voice project, largely to help make sure that we're not leaving our technical contributors and dataset users to wait for ages like we've done with you.

You have my apologies and my task for April is to get through the backlog of PRs, issues and support queries here on GitHub and across 3 other platforms. While things may still feel a little slow while I'm addressing the backlog (and learning how everything works!), handling PRs, issues and support queries should feel much easier in the short term future. If you do find that things are stuck, there's also a designated person 👋 around part time to help unstick things.

I do apologize again for the experience you've had. It's my goal to make sure that you (and everyone else!) doesn't have to deal with long waits like this in the future. If you have any additional feedback you can chat to me here, at jessicar@mozillafoundation.org or on Discourse or Matrix.

@drzraf
Copy link
Contributor Author

drzraf commented Apr 3, 2023

The issue is about responsibility: Is there someone feeling responsible about FR sentences dataset quality and having merge permission?
If not, what's the commitment of merge-enabled user and which level of reactivity can be assumed?

I'm asking because there is a lot to do to continue improving this dataset but that won't be possible without clear workflows and clearly identified reviewer(s)/merger(s).

@drzraf
Copy link
Contributor Author

drzraf commented Apr 10, 2023

To spare this issue to getting longer with lot of "meta-discussion" and "pings", could you please, @jessicarose, tackle this issue via a direct contact with @MichaelKohler and/or any person(s) responsible for reviewing and merging community's PR in this project and then let us know here the final decision regarding this and future contributions?

Thank you

@jessicarose
Copy link
Collaborator

Thanks so much for waiting, we wanted to get a few behind the scenes issues with the Sentence Collector (common-voice/sentence-collector#668) ironed out before accepting this PR to make sure that there wouldn't be any recurring challenges with target sentences being re-imported.

Who holds responsibility for different project areas is a very fair question to ask! At the moment, I'm going to be the primary point of contact with merge permissions for PRs across the different MCV repos, including FR datasets (though Gina is involved in this domain area. This may evolve as more team members onboard, but right now I'll merge this with my thanks and start chipping away at the backlog to prevent similar blockages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants