PyThaiNLP 2.1 change log #181

wannaphong · 2019-04-01T07:04:44Z

2.1 released on 10 Dec 2019
2.1.1 released on 19 Dec 2019
2.1.2 released on 31 Dec 2019
2.1.3 released on 10 Jan 2020
2.1.4 released on 7 Feb 2020

Corpus

Add Thai female, male names corpus (Add Thai female, male names corpus #217 [Corpus] Include Thai person names directly into the package #297) - thanks @korkeatw @c4n @bact
- Thai male, female names corpus https://github.com/korkeatw/thai-names-corpus
- General Election 2019 candidate names https://github.com/codeforthailand/dataset-election-62-candidates/tree/master/data
Add PYTHAINLP_DATA_DIR environment variable to set location of downloaded data (default is ~/pythainlp-data) (add option of setting data dir with an enviromental variable #238 Added docs on PYTHAINLP_DATA_DIR environ variable #294) - thanks @dhpollack @abhabongse
Remove racing condition when create data directory (Remove racing condition in making pythainlp data directory #278) - thanks @abhabongse

Localization

Add pythainlp.util.thai_time Time spell out to Thai words (Add pythainlp.util.thai_time #303) thanks @wannaphong @abhabongse @bact
Fix bahttext bug for a value of one million (bahttext not working for 1,000,000 #350) thanks @wannaphong

Tokenizer

pythainlp.tokenize.Tokenizer is now immediately available when import pythainlp (79432c2) - thanks @korakot
Add ssg, a CRF syllable segmentor (Questions on the implementation of syllable_tokenize #229 Alternative syllable tokenizer #237 Add ssg #242) - thanks @wannaphong @ponrawee @heytitle
Add AttaCut, a fast and accurate tokenizer, is now available through engine="attacut" in pythainlp.tokenize.word_tokenize() (Integrate AttaCut to PyThaiNLP #258, add attacut to pythainlp/tokenize #261) - thanks @heytitle @bkktimber
Tokenization benchmark (Add tokenization-benchmark to PyThaiNLP #248 Tokenization benchmark miscalculate word-level metrics #268 Fix tokenization benchmark issue #269) - thanks @wannaphong @heytitle
New engine newmm-safe for pythainlp.tokenize.word_tokenize() - a newmm engine with additional mechanism to avoid possible exponentially long wait for long text with a lot of ambiguity in breaking points. ("newmm-safe" option -- fix newmm issue, take too long time for long text with lots of ambiguity breaking points #302) - thanks @bact
Fix newmm engine, to help avoid possible long wait (Add graph size limit in _onecut() to avoid long wait for ambiguous text #333) (available in 2.1.1, backport from 2.2) - thanks @bact
Fix longest engine, last character is now consumed (Longest Match segment fails when the entire input text is a full word. #357) (available in 2.1.4 - thanks @bact

Spellchecker

Avoid the spell check for numeric string (numeric string gives "ใน" as output in some length of string. #276 Fix "ใน" correction when pass numeric type into correct function in spell module #288) - thanks @nawaphonOHM @Peradon

Named-Entity Tagger

Add html-like tag in output (NER: Add html-like tag in output #262 ThaiNER : The output of the html-like is incorrect. #346) - thanks @wannaphong

Dependency cleaning

Removing and updating many dependencies - thanks @c4n @artificiala @cstorm125 @korakot @bact @wannaphong

Remove:

keras, tensorflow (Port Thai2Rom from Keras to PyTorch #202 Thai2Rom on PyTorch (seq2seq no attention mechanism) #235 pytorch seq2seq implementation for Thai romanization #246) - Thai romanization is now implemented in PyTorch
fastai (Remove fastai from the dependencies #252) - removing and replacing pythainlp.ulmfit preprocessing-related code with a self-implemented one
marisa-trie (Change from marisa-trie to a Trie implementation written in python #277) - removing and replacing with native Trie implementation
deepcut (Remove deepcut, keras, tensorflow from dependencies #283) - removing, word tokenizer still support engine="deepcut" but the user needs to install dependencies (deepcut, keras, tensorflow) by themselves

Update:

artagger (Use artagger from main repo, use tensorflow < 2 #281) - updating to use one from the main repo (was depends on a fork)
Include only direct dependencies in setup.py (Include only direct dependency in setup.py #275)
Push the version requirement for dependencies to the lowest possible (Minimum possible version requirement #292)

Documentation

Docstring and type annotation fixes (pythainlp.spell: Fix type annotations, docstring spellings, etc. #279) - thanks @abhabongse
Updated tutorial notebooks and moved to https://github.com/PyThaiNLP/tutorials (Remove tutorial notebooks from the PyThaiNLP/pythainlp repository (#270) #282) - thanks @artificiala @cstorm125
Citation fix ([WIP] Update documents #284) - thanks @heytitle
Docstring code and output example style changed, make it easier to copy & paste code (Improve document #293) - thanks @heytitle

Others

Fix normalization, to include case THANTHAKHAT and SARA U, SARA UU (Include case THANTHAKHAT and SARA U, UU too #244) - thanks @korakot @ekapolc
Better command-line interface (Better CLI #251, Update command_line.rst #271) - thanks @heytitle @wannaphong
Improve code readability (Improve readability of some thai characters #287 Remove magic number #290 Redefine range loop #291) - thanks @boomsquared @Peradon
Refactor:
- Refactor the test files (Issue #224: Refactor the test file #234) - thanks @artificiala
- Optimize keyboard layout switching translation code & digits translation (Optimize keyboard layout switching translation code & digits translation #280) - thanks @abhabongse
- Refactor util package as well as improve performance (Refactor util package as well as improve performance #295) - thanks @Peradon

The text was updated successfully, but these errors were encountered:

p16i · 2019-05-16T03:08:04Z

it would be more efficient if we have a way to automatically generate this change log.

cstorm125 · 2019-05-16T13:58:32Z

@heytitle agreed. do you have any suggestion?

bact · 2019-12-10T15:15:31Z

Publicly announced the 2.1, close the issue.
https://www.blognone.com/node/113587

wannaphong added this to the 2.1 milestone Apr 1, 2019

bact added enhancement enhance functionalities documentation improve documentation and test cases labels Oct 1, 2019

bact closed this as completed Dec 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyThaiNLP 2.1 change log #181

PyThaiNLP 2.1 change log #181

wannaphong commented Apr 1, 2019 •

edited by bact

Loading

p16i commented May 16, 2019

cstorm125 commented May 16, 2019

bact commented Dec 10, 2019

PyThaiNLP 2.1 change log #181

PyThaiNLP 2.1 change log #181

Comments

wannaphong commented Apr 1, 2019 • edited by bact Loading

Corpus

Localization

Tokenizer

Spellchecker

Named-Entity Tagger

Dependency cleaning

Documentation

Others

p16i commented May 16, 2019

cstorm125 commented May 16, 2019

bact commented Dec 10, 2019

wannaphong commented Apr 1, 2019 •

edited by bact

Loading