Skip to content
This repository has been archived by the owner on May 10, 2023. It is now read-only.

Question: Process for already exported sentences? #632

Closed
HarikalarKutusu opened this issue Aug 15, 2022 · 4 comments
Closed

Question: Process for already exported sentences? #632

HarikalarKutusu opened this issue Aug 15, 2022 · 4 comments
Labels
question Further information is requested

Comments

@HarikalarKutusu
Copy link
Contributor

After all efforts, there are some sentences with mistypings already exported. In addition to these, there are some bad sentences from unmoderated times (mostly abbreviations and many foreign words, these are mostly reported and in reported.tsv).

These result to misc recordings. Unknown/foreign words get mispronounced and mistyping can be read as they are or read corrected...

Is there a process to correct these?

  • Correct/remove in exported sentence-collector.tsv?
  • Remove also bad recordings?
  • ?

We are working on dataset health issues and in the process of correcting them in a post-process phase. But if these get not corrected they might cause additional bad recordings in the future.

@MichaelKohler MichaelKohler added the question Further information is requested label Aug 15, 2022
@MichaelKohler
Copy link
Member

@Heyhillary could you bring this up with the team? I think we need a general process or documentation for these cases and not just something from the sentence collector side. What do you think? Thanks!

@drzraf
Copy link

drzraf commented Sep 13, 2022

(Coming from common-voice/common-voice#3786)

I think that sentences creating problems:

  1. Should be removed from the corpus
  2. Should be removed from the list of recordings
  3. An adequate filter put in place in the cleanup/validation in order to fix this particular class of problems for any future sentence import

Rational: It's of uttermost importance to keep the corpus clean in order to:

  • Maintain recordings quality

  • Ensure most speakers understand the sentences

  • Ease contribution of the general public

  • Removing some hundreds or even thousands of sentences possibly/probably misspelled doesn't sound grave, in comparison : Good fresh ones will come in soon enough.

  • Given the size of the corpus % number of recordings, it's likely that problematic sentences have been spoken only once or twice, making possible errors more impacting.

@HarikalarKutusu
Copy link
Contributor Author

it's likely that problematic sentences have been spoken only once or twice, making possible errors more impacting.

Unfortunately, more than that for Turkish... In the first 4 years without any moderation and with the initial text-corpus, SETimes, which are Balkan news and has many proper names from Balkans, people recorded these 3-4 times. This was how I started this journey :/

I'm currently writing a middleware (open-source) to exclude those sentences or some bad voices (which needs a long moderation by multiple people, via another software) before feeding to training. We will see in a month or so...

@MichaelKohler
Copy link
Member

@HarikalarKutusu I'm archiving this project. Instead of just moving this issue over to the main CV repo, I would suggest to create a new, more generic issue if still relevant.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants