Skip to content

dbeteta-w/linguistic_data_treatment

Repository files navigation

Linguistic Data Treatment

Contributors Forks Stars Licence Issues

Description

The purpose of this repo is to bring together the main cleaning processes in order to get the most value from a dirty linguistic dataset.

In such way, it's divided the concept of "cleaning process" into two:

  1. Normalizers: which goal is to give consistency to the entire dataset => EXAMPLE
  2. Validators: which goal is to check if a piece of the dataset is valuable or not => EXAMPLE

In the same line, there are another type of functions which are called "Helpers" and its goal is to avoid rewriting functions that are used mainly in the normalizers and the validators => EXAMPLE

Last but not least, it's important that you know that these processes are made to be able to process both monolingual and bilingual datasets => EXAMPLE

Recommended Usage

You can find a clear example of usage for monolingual datasets HERE and for bilingual datasets HERE.

What it's done is, on the one hand bringing all the desired normalizers into just one function with the purpose of defining what it's understood for "normalize" in this particular case. And, in the other hand, it's done a similar thing for "validate" with the particularity that in this case it's made to keep track of the invalid parts of the dataset.

Note: You want to use certain "cleaning processes" or other depending on the language(s) of the dataset to be treated.

Perspective of future

Although the origin of this repo is made thinking mainly of the English-Spanish combination, it's expected to increase the amount of "cleaning processes" in order to reach the most languages as possible.

Contribution Guidelines

The contribution guidelines are as per the guide HERE.

Instructions

  • Fork this Repository
  • Clone your forked repository
  • Add your process
  • Commit & Push
  • Create a pull request
  • Star this repository
  • Wait for Pull Request to merge
  • Celebrate, your first step into the open Source World and contribute more

Note: When you Add a process Add it to the README for ease of solving any kind of issue

Current Cleaning Processes

Cleaning Process Author
1 Has A Properly Amount Of Words Daniel Beteta
2 Has Parallel Number Val Daniel Beteta
3 Has Parallel Symbol Val Daniel Beteta
4 Has Properly Length Factor Val Daniel Beteta
5 Has Too Many Numbers Daniel Beteta
6 Is In The Accurate Language Daniel Beteta
7 Is Repeated Daniel Beteta
8 Get Text With Normalized Quotes Daniel Beteta
9 Get Text With Normalized Spaces Daniel Beteta
10 Get Text With Normalized Unicode Characters Daniel Beteta
11 Get Text Without Initial Index Daniel Beteta
12 Get Text Without Repeated Symbols Daniel Beteta
13 Get Text Without Tags Daniel Beteta

Releases

No releases published

Packages

No packages published

Languages