Linguistic Data Treatment

Description

The purpose of this repo is to bring together the main cleaning processes in order to get the most value from a dirty linguistic dataset.

In such way, it's divided the concept of "cleaning process" into two:

Normalizers: which goal is to give consistency to the entire dataset => EXAMPLE
Validators: which goal is to check if a piece of the dataset is valuable or not => EXAMPLE

In the same line, there are another type of functions which are called "Helpers" and its goal is to avoid rewriting functions that are used mainly in the normalizers and the validators => EXAMPLE

Last but not least, it's important that you know that these processes are made to be able to process both monolingual and bilingual datasets => EXAMPLE

Recommended Usage

You can find a clear example of usage for monolingual datasets HERE and for bilingual datasets HERE.

What it's done is, on the one hand bringing all the desired normalizers into just one function with the purpose of defining what it's understood for "normalize" in this particular case. And, in the other hand, it's done a similar thing for "validate" with the particularity that in this case it's made to keep track of the invalid parts of the dataset.

Note: You want to use certain "cleaning processes" or other depending on the language(s) of the dataset to be treated.

Perspective of future

Although the origin of this repo is made thinking mainly of the English-Spanish combination, it's expected to increase the amount of "cleaning processes" in order to reach the most languages as possible.

Contribution Guidelines

The contribution guidelines are as per the guide HERE.

Instructions

Fork this Repository
Clone your forked repository
Add your process
Commit & Push
Create a pull request
Star this repository
Wait for Pull Request to merge
Celebrate, your first step into the open Source World and contribute more

Note: When you Add a process Add it to the README for ease of solving any kind of issue

Current Cleaning Processes

Nº	Cleaning Process	Author
1	Has A Properly Amount Of Words	Daniel Beteta
2	Has Parallel Number Val	Daniel Beteta
3	Has Parallel Symbol Val	Daniel Beteta
4	Has Properly Length Factor Val	Daniel Beteta
5	Has Too Many Numbers	Daniel Beteta
6	Is In The Accurate Language	Daniel Beteta
7	Is Repeated	Daniel Beteta
8	Get Text With Normalized Quotes	Daniel Beteta
9	Get Text With Normalized Spaces	Daniel Beteta
10	Get Text With Normalized Unicode Characters	Daniel Beteta
11	Get Text Without Initial Index	Daniel Beteta
12	Get Text Without Repeated Symbols	Daniel Beteta
13	Get Text Without Tags	Daniel Beteta

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
processes		processes
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
get_bilingual_files_processed.py		get_bilingual_files_processed.py
get_monolingual_files_processed.py		get_monolingual_files_processed.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Linguistic Data Treatment

Description

Recommended Usage

Note: You want to use certain "cleaning processes" or other depending on the language(s) of the dataset to be treated.

Perspective of future

Contribution Guidelines

Instructions

Note: When you Add a process Add it to the README for ease of solving any kind of issue

Current Cleaning Processes

About

Releases

Packages

Languages

License

dbeteta-w/linguistic_data_treatment

Folders and files

Latest commit

History

Repository files navigation

Linguistic Data Treatment

Description

Recommended Usage

Note: You want to use certain "cleaning processes" or other depending on the language(s) of the dataset to be treated.

Perspective of future

Contribution Guidelines

Instructions

Note: When you Add a process Add it to the README for ease of solving any kind of issue

Current Cleaning Processes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages