Preprocessing tool for Korean NLP tasks
Functions
- Removes all characters except essential punctuation and Korean/English characters
- Removes lines with only English
- Removes lines that are a part of a list (Begin with numeral + .)
- Cleans lines by removing parantheses and the content inside of parantheses
- Removes lines with links/html content
- Removes unmatched parantheses and brackets
- Cleans repeated items (아하하하하하하 -> 아하하)
Options
Pass these optional functions as booleans to argsparse
--news
: Cleans news items in Korean web crawl text--datetime
: Cleans date-time text (ex. 2020-12-12 13:44)
Refer to requirements.txt
pip install -r 'requirements.txt'
# Get help with options
python3 main.py -h
# Example usage
python3 main.py -i /data/dataset.txt -o /data_cleaned/dataset.txt -m 10 --news True -dt True
You can also add your custom regex functions to main.py to customize code for certain types of text data that require extra cleaning.
Example (in main.py):
args = parser.parse_args()
txt = PreProcessing(args)
# Add extra cleaning step to cleain html text '\br' from all lines with one line
txt.lines = [l.replace('/br','') for l in txt.lines]
txt.apply()