Question: Process for already exported sentences? #632

HarikalarKutusu · 2022-08-15T06:20:13Z

After all efforts, there are some sentences with mistypings already exported. In addition to these, there are some bad sentences from unmoderated times (mostly abbreviations and many foreign words, these are mostly reported and in reported.tsv).

These result to misc recordings. Unknown/foreign words get mispronounced and mistyping can be read as they are or read corrected...

Is there a process to correct these?

Correct/remove in exported sentence-collector.tsv?
Remove also bad recordings?
?

We are working on dataset health issues and in the process of correcting them in a post-process phase. But if these get not corrected they might cause additional bad recordings in the future.

MichaelKohler · 2022-08-15T17:11:02Z

@Heyhillary could you bring this up with the team? I think we need a general process or documentation for these cases and not just something from the sentence collector side. What do you think? Thanks!

drzraf · 2022-09-13T15:12:38Z

(Coming from common-voice/common-voice#3786)

I think that sentences creating problems:

Should be removed from the corpus
Should be removed from the list of recordings
An adequate filter put in place in the cleanup/validation in order to fix this particular class of problems for any future sentence import

Rational: It's of uttermost importance to keep the corpus clean in order to:

Maintain recordings quality
Ensure most speakers understand the sentences
Ease contribution of the general public
Removing some hundreds or even thousands of sentences possibly/probably misspelled doesn't sound grave, in comparison : Good fresh ones will come in soon enough.
Given the size of the corpus % number of recordings, it's likely that problematic sentences have been spoken only once or twice, making possible errors more impacting.

HarikalarKutusu · 2022-09-13T17:48:52Z

it's likely that problematic sentences have been spoken only once or twice, making possible errors more impacting.

Unfortunately, more than that for Turkish... In the first 4 years without any moderation and with the initial text-corpus, SETimes, which are Balkan news and has many proper names from Balkans, people recorded these 3-4 times. This was how I started this journey :/

I'm currently writing a middleware (open-source) to exclude those sentences or some bad voices (which needs a long moderation by multiple people, via another software) before feeding to training. We will see in a month or so...

MichaelKohler · 2023-05-10T20:55:25Z

@HarikalarKutusu I'm archiving this project. Instead of just moving this issue over to the main CV repo, I would suggest to create a new, more generic issue if still relevant.

MichaelKohler added the question Further information is requested label Aug 15, 2022

MichaelKohler closed this as completed May 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Process for already exported sentences? #632

Question: Process for already exported sentences? #632

HarikalarKutusu commented Aug 15, 2022

MichaelKohler commented Aug 15, 2022

drzraf commented Sep 13, 2022

HarikalarKutusu commented Sep 13, 2022

MichaelKohler commented May 10, 2023

Question: Process for already exported sentences? #632

Question: Process for already exported sentences? #632

Comments

HarikalarKutusu commented Aug 15, 2022

MichaelKohler commented Aug 15, 2022

drzraf commented Sep 13, 2022

HarikalarKutusu commented Sep 13, 2022

MichaelKohler commented May 10, 2023