Add a more advanced normalize function #374

bact · 2020-03-31T13:03:53Z

Common normalize function that do more than reordering of Thai characters. Something that can be used quickly for matching, searching, sorting, and preparing data for classification tasks.

Some ideas;

Remove non-visible characters, like zero-width chars Add a function to remove zero-width characters #373
Remove unnecessary spaces
Normalize repetitions
Normalize "obvious" mistakes like
- consonant + tonemark A + tonemark B <--- may be we can remove tonemark A

Note that for Unicode normalization, Python does already have unicodedata.normalize().

The text was updated successfully, but these errors were encountered:

p16i · 2020-04-01T18:18:48Z

How about the case of ำ? I sometimes see PDF parsers return it as ◌̊ and า.

bact · 2020-04-02T16:22:45Z

Current normalize() will group Nikhahit and Sara Aa together and emit Sara Am

bact added the enhancement enhance functionalities label Mar 31, 2020

bact mentioned this issue May 7, 2020

Fixing and enhancing text normalization funcitons #389

Merged

bact closed this as completed in #389 May 8, 2020

bact mentioned this issue May 8, 2020

PyThaiNLP 2.2 change log #330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a more advanced normalize function #374

Add a more advanced normalize function #374

bact commented Mar 31, 2020 •

edited

Loading

p16i commented Apr 1, 2020

bact commented Apr 2, 2020

Add a more advanced normalize function #374

Add a more advanced normalize function #374

Comments

bact commented Mar 31, 2020 • edited Loading

p16i commented Apr 1, 2020

bact commented Apr 2, 2020

bact commented Mar 31, 2020 •

edited

Loading