Remove sentences containing foreign-language proper nouns / last names #3786

drzraf · 2022-09-07T14:11:24Z

Type of Pull Request

Bulk sentence removal
Related to issue [dataset] French, proper nouns. garbage sentence removal #3785

@upstream: This PR implies a good share of CPU for text-processing and rebase (forced on by the weekly Automatic Sentence Collector Export takes eons to proceed)

ftyers · 2022-09-07T14:19:55Z

Are there already recordings of these sentences? Have you cross-checked with the data in the latest release of Common Voice?

drzraf · 2022-09-07T14:33:38Z

Sure there are already one or multiple recording of many of these ~ 687k sentences
What do you mean by "Cross-checked with the data in the latest release of Common Voice"? cross-checked what?

ftyers · 2022-09-07T14:36:52Z

I may be wrong (we need to check with @mozgzh), but I think that if you remove sentences that already have recordings then we will lose access to those recordings. I doubt that all 687k are recorded, so it might be better just to remove the ones that don't yet have recordings, that's what I mean by cross-check with the last release. Btw, I definitely agree that it is a good idea to clean this stuff up. The use-case for having a model that can do ASR on spoken Wikipedia is pretty limited.

drzraf · 2022-09-07T14:45:06Z

I see. I'll check for that. Still, even if some already got records associated:

Offering them to new contributors is not good anyway
If the WER goes up, it would confirm that removing these sentence is adequate even if recordings already exist
If it goes down (would be really strange), then removing the unrecorded ones only, could be, indeed, a good and safe first step

ftyers · 2022-09-07T15:14:07Z

Regarding WER: Common Voice does not train models, it only releases data, so we have no way of knowing -- aside from reports from the community -- if the WER would go up or not.

So an actionable step would to remove the ones without recordings, and then look for a solution for not presenting the ones that already have recordings to new users. You should also probably get in touch with other members of the French community in Common Voice. There is a Matrix channel for them.

MichaelKohler

I will let @ftyers review this, however I have one comment: removing the sentences from the sentence-collector.txt file won't have any long-lasting effect. It will just get re-exported with the next automatic export. Therefore, could you revert the changes to that file?

Once this PR is ready to go, I would love to have a txt file with one sentence per line to remove from Sentence Collector. Then I can remove those from the DB.

drzraf · 2022-09-07T19:33:15Z

Regarding usage of the 687k sentences : 144.899 recordings of 135.537 unique sentences (most of them having only one recording)

    1.0000 -     1.0000 [129436]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    1.0000 -     2.0000 [  3327]: ∎
    2.0000 -     3.0000 [  2352]: ∎
    3.0000 -     4.0000 [   395]: 
    4.0000 -     5.0000 [     9]: 
    5.0000 -     6.0000 [     6]: 
    6.0000 -     7.0000 [     6]: 
    7.0000 -     8.0000 [     4]: 
    8.0000 -     9.0000 [     2]:

ftyers · 2022-09-12T17:09:25Z

Ok, so the majority can be removed. Those 135.537 should be left.

drzraf · 2022-09-13T06:00:52Z

Updated the patch. rebased + various improvements + not removing sentences having a record + not altering sentence-collector.txt but created issue3785_proper-nouns-deletion.txt instead.

Statistics summary:

273.205 tokens to identified for removal
... matching 663.925 unique sentences (708.340 accounting for duplicates) ¹
But 139.534 of them intersected with the 491.052 (unique) sentences having a record (see histogram ²)
That made up 142.383 recorded sentences to preserve and 524.391 unique, unrecorded, sentences to remove (562.092 actual sentences removed accounting for duplicates)

¹ As an example: "Cette espèce est endémique du Queensland en Australie" has 629 occurrences (and there are many others)
² Sentences initially queued for removal but for whose one or multiple record(s) exist

    1.0000 -     1.0000 [136794]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    1.0000 -     2.0000 [  2674]: ∎
    2.0000 -     3.0000 [    35]: 
    3.0000 -     4.0000 [    22]: 
    4.0000 -     6.0000 [     9]:

CapitainFlam · 2022-09-13T11:06:30Z

I will let @ftyers review this, however I have one comment: removing the sentences from the sentence-collector.txt file won't have any long-lasting effect. It will just get re-exported with the next automatic export. Therefore, could you revert the changes to that file?

Once this PR is ready to go, I would love to have a txt file with one sentence per line to remove from Sentence Collector. Then I can remove those from the DB.

Hi. I'm nobody but a random passenger.

But I wanted to help, and to highlight that this proposition from @drzraf (#3786 (comment)) seems to answer this change request, as written in "not altering sentence-collector.txt but created issue3785_proper-nouns-deletion.txt instead"

To be validated by the right people 😸

drzraf · 2022-09-13T14:24:37Z

Average line-removal rate per text origin:

vfvv 0.65%
Gutenberg 3.55%
libretheatre : 5.32%
assemblee-nationale : Would have been around 5% but mostly unchanged because most of them had existing records
wiki : 26.40%

MichaelKohler · 2022-09-13T21:37:59Z

Specifically for server/data/fr/issue3785_proper-nouns-deletion.txt: if we keep this file these sentences will be tried to be imported on every start of the service. However, I'm attaching it here as an attachment for later, when we delete the sentences from Sentence Collector.

Therefore that file can now be removed from the PR.

to-be-deleted-sentence-collector.txt

drzraf · 2022-09-13T22:54:25Z

Therefore that file can now be removed from the PR.

Done.

But then it should have been the same for issue2259_deleted_export_readd_fixed.txt which essentially contain a lot of buggy sentences, shouldn't it?

MichaelKohler · 2022-09-13T23:10:59Z

But then it should have been the same for issue2259_deleted_export_readd_fixed.txt which essentially contain a lot of buggy sentences, shouldn't it?

That was meant to be fixed sentences, that previously were removed from the Sentence Collector. However that probably was only the fix for the encoding and not an actual review of the sentences themselves. So basically, to fix an encoding issue, we deleted sentences in Sentence Collector and then used a PR to readd fixed versions of them, as they already had been approved before.

In your case, that file was a log of sentences to delete, therefore not "fixed sentences". Does that make sense?

In any case, if that file contains buggy sentences, then that needs to be dealt with separately.

drzraf · 2022-09-20T11:54:31Z

Could we please get this one handled?

wasertech · 2022-09-21T15:26:29Z

Regarding WER: Common Voice does not train models, it only releases data, so we have no way of knowing -- aside from reports from the community -- if the WER would go up or not.

You should also probably get in touch with other members of the French community in Common Voice.

He did and we agreed a cleanup would really benefit our next model. I trust the CV team to choose which mods are acceptable and which are not but I'm positive WER can only go down if we remove mistaken data.

I cannot spend the energy required to make a comparison model just to show the impact of the currents mods on WER so I would really appreciate for some (or all) of those mods to be in the next CV release so that I can leverage them in my next training session. 🥰

drzraf · 2022-10-17T01:57:32Z

ping? Could this be merged?

drzraf · 2023-03-05T20:11:58Z

Why wasn't this merged and stuck seemingly forever? This PR was discussed and explained in great length (on the forum and here), reviewed then edited accordingly and made as good as possible and it's likely to decrease the WER. Could the person who takes the responsibility of merging/rejecting them (or letting them lag) either merge or comment? Thank you.

@MichaelKohler / @ftyers

drzraf · 2023-03-24T17:44:46Z

@moz-rotimib ? (seems the last human, here, having committed to this repo in the last months).

ftyers · 2023-03-27T16:07:07Z

Hi @drzraf, can you confirm that this does not remove any sentences that have recordings in v13 and then we can go ahead and merge it I think. @MichaelKohler thoughts? Apologies for the long wait!

- The sentences to be removed from the sentence collector have been attached to PR common-voice#3786 instead of being actually removed from `sentence-collector.txt` - Sentences having at least one existing record (as of `cv-13.0-2023-03-09`) have been preserved.

drzraf · 2023-03-28T03:49:31Z

273.205 tokens to identified for removal
... matching 677.861 unique sentences
... 160.153 of these intersected with the 541.529 (unique) sentences having a record (see histogram)
That made up 160.153 recorded sentences to preserve and 509.175 unique, unrecorded, sentences to remove (545.767 actual sentences removed from the repository accounting for duplicates and 534.534 (the diff size) omitting the change to sentence-collector.txt

Number of recordings per sentence:

    1.0000 -     1.0000 [ 25436]: ∎∎∎∎
    1.0000 -     2.0000 [461089]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
    2.0000 -     3.0000 [ 17497]: ∎∎
    3.0000 -     4.0000 [ 29132]: ∎∎∎∎
    4.0000 -     5.0000 [  5708]: 
    5.0000 -     6.0000 [   729]: 
    6.0000 -     7.0000 [   820]: 
    7.0000 -     8.0000 [   445]: 
    8.0000 -  1492.0000 [   673]:

wiki-* files still account for 97.8% of the diff.

Average removal per text origin:

vfvv basically no change
Gutenberg 3.68%
libretheatre : 5.62%
assemblee-nationale : basically no change
wiki : 34.40%

Force pushed

Attached here the new version of the sentences to be removed manually from the sentence-collector
to-be-deleted-sentence-collector.txt

drzraf · 2023-03-30T02:11:33Z

Dears,
with all due respect, I won't rebase this new/refreshed PR anymore if it happened to languish around another x months (until after a new CV 14 dataset is released).

I strongly believe that due consideration of contributor' PRs is key in the success of an OSS and I'm worried about project's future when I see that even such a PR requires so much effort to get in (6 months and 22 comments/pings/reminders/... !)

jessicarose · 2023-03-30T09:31:06Z

Hello @drzraf! The length of time you've had to wait isn't acceptable. I'm Jess, the technology community manager who has just recently started on the Common Voice project, largely to help make sure that we're not leaving our technical contributors and dataset users to wait for ages like we've done with you.

You have my apologies and my task for April is to get through the backlog of PRs, issues and support queries here on GitHub and across 3 other platforms. While things may still feel a little slow while I'm addressing the backlog (and learning how everything works!), handling PRs, issues and support queries should feel much easier in the short term future. If you do find that things are stuck, there's also a designated person 👋 around part time to help unstick things.

I do apologize again for the experience you've had. It's my goal to make sure that you (and everyone else!) doesn't have to deal with long waits like this in the future. If you have any additional feedback you can chat to me here, at jessicar@mozillafoundation.org or on Discourse or Matrix.

drzraf · 2023-04-03T17:29:58Z

The issue is about responsibility: Is there someone feeling responsible about FR sentences dataset quality and having merge permission?
If not, what's the commitment of merge-enabled user and which level of reactivity can be assumed?

I'm asking because there is a lot to do to continue improving this dataset but that won't be possible without clear workflows and clearly identified reviewer(s)/merger(s).

drzraf · 2023-04-10T17:04:47Z

To spare this issue to getting longer with lot of "meta-discussion" and "pings", could you please, @jessicarose, tackle this issue via a direct contact with @MichaelKohler and/or any person(s) responsible for reviewing and merging community's PR in this project and then let us know here the final decision regarding this and future contributions?

Thank you

jessicarose · 2023-04-12T14:14:17Z

Thanks so much for waiting, we wanted to get a few behind the scenes issues with the Sentence Collector (common-voice/sentence-collector#668) ironed out before accepting this PR to make sure that there wouldn't be any recurring challenges with target sentences being re-imported.

Who holds responsibility for different project areas is a very fair question to ask! At the moment, I'm going to be the primary point of contact with merge permissions for PRs across the different MCV repos, including FR datasets (though Gina is involved in this domain area. This may evolve as more team members onboard, but right now I'll merge this with my thanks and start chipping away at the backlog to prevent similar blockages.

drzraf requested a review from a team as a code owner September 7, 2022 14:11

drzraf requested review from data-sync-user and removed request for a team September 7, 2022 14:11

drzraf mentioned this pull request Sep 7, 2022

[dataset] French, proper nouns. garbage sentence removal #3785

Open

MichaelKohler requested changes Sep 7, 2022

View reviewed changes

drzraf force-pushed the fix-3785 branch 2 times, most recently from 769bd77 to 84e9cdf Compare September 7, 2022 18:53

drzraf force-pushed the fix-3785 branch from 84e9cdf to 371191e Compare September 13, 2022 05:58

drzraf requested review from MichaelKohler and removed request for data-sync-user September 13, 2022 11:46

CapitainFlam mentioned this pull request Sep 13, 2022

[WIP] additionnal lib/cleanup for French language to improve quality of inputs common-voice/sentence-collector#635

Closed

drzraf mentioned this pull request Sep 13, 2022

Question: Process for already exported sentences? common-voice/sentence-collector#632

Closed

CapitainFlam mentioned this pull request Sep 21, 2022

web/doc: per language validation criterias list #3796

Open

drzraf force-pushed the fix-3785 branch from 0c818a7 to 74d8b03 Compare March 28, 2023 03:49

moz-dfeller mentioned this pull request Apr 12, 2023

fix: remove fr garbage sentences common-voice/sentence-collector#668

Merged

jessicarose merged commit e7eb366 into common-voice:main Apr 12, 2023

HarikalarKutusu mentioned this pull request Jun 12, 2024

[BUG] Retire toponym corpus from Catalan Common Voice #4337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove sentences containing foreign-language proper nouns / last names #3786

Remove sentences containing foreign-language proper nouns / last names #3786

drzraf commented Sep 7, 2022 •

edited

Loading

ftyers commented Sep 7, 2022 •

edited

Loading

drzraf commented Sep 7, 2022

ftyers commented Sep 7, 2022 •

edited

Loading

drzraf commented Sep 7, 2022

ftyers commented Sep 7, 2022

MichaelKohler left a comment

drzraf commented Sep 7, 2022 •

edited

Loading

ftyers commented Sep 12, 2022

drzraf commented Sep 13, 2022 •

edited

Loading

CapitainFlam commented Sep 13, 2022 •

edited

Loading

drzraf commented Sep 13, 2022

MichaelKohler commented Sep 13, 2022

drzraf commented Sep 13, 2022

MichaelKohler commented Sep 13, 2022

drzraf commented Sep 20, 2022

wasertech commented Sep 21, 2022

drzraf commented Oct 17, 2022

drzraf commented Mar 5, 2023 •

edited

Loading

drzraf commented Mar 24, 2023

ftyers commented Mar 27, 2023

drzraf commented Mar 28, 2023 •

edited

Loading

drzraf commented Mar 30, 2023

jessicarose commented Mar 30, 2023

drzraf commented Apr 3, 2023

drzraf commented Apr 10, 2023

jessicarose commented Apr 12, 2023

Remove sentences containing foreign-language proper nouns / last names #3786

Remove sentences containing foreign-language proper nouns / last names #3786

Conversation

drzraf commented Sep 7, 2022 • edited Loading

Type of Pull Request

ftyers commented Sep 7, 2022 • edited Loading

drzraf commented Sep 7, 2022

ftyers commented Sep 7, 2022 • edited Loading

drzraf commented Sep 7, 2022

ftyers commented Sep 7, 2022

MichaelKohler left a comment

Choose a reason for hiding this comment

drzraf commented Sep 7, 2022 • edited Loading

ftyers commented Sep 12, 2022

drzraf commented Sep 13, 2022 • edited Loading

CapitainFlam commented Sep 13, 2022 • edited Loading

drzraf commented Sep 13, 2022

MichaelKohler commented Sep 13, 2022

drzraf commented Sep 13, 2022

MichaelKohler commented Sep 13, 2022

drzraf commented Sep 20, 2022

wasertech commented Sep 21, 2022

drzraf commented Oct 17, 2022

drzraf commented Mar 5, 2023 • edited Loading

drzraf commented Mar 24, 2023

ftyers commented Mar 27, 2023

drzraf commented Mar 28, 2023 • edited Loading

drzraf commented Mar 30, 2023

jessicarose commented Mar 30, 2023

drzraf commented Apr 3, 2023

drzraf commented Apr 10, 2023

jessicarose commented Apr 12, 2023

drzraf commented Sep 7, 2022 •

edited

Loading

ftyers commented Sep 7, 2022 •

edited

Loading

ftyers commented Sep 7, 2022 •

edited

Loading

drzraf commented Sep 7, 2022 •

edited

Loading

drzraf commented Sep 13, 2022 •

edited

Loading

CapitainFlam commented Sep 13, 2022 •

edited

Loading

drzraf commented Mar 5, 2023 •

edited

Loading

drzraf commented Mar 28, 2023 •

edited

Loading