Added a vocabulary_size argument to UnicodeCharacterTokenizer #163

aflah02 · 2022-05-03T20:20:32Z

Fixes #155
I've also added new tests and fixed old tests by modifying the config files with the new parameter

…into WorkingOnVocab

aflah02 · 2022-05-03T20:21:26Z

@mattdangerw I think the PR is ready for review now. Here's a demo

mattdangerw

Thanks! This looks great. Just a few comments then this is ready to go.

keras_nlp/tokenizers/unicode_character_tokenizer.py

keras_nlp/tokenizers/unicode_character_tokenizer_test.py

keras_nlp/tokenizers/unicode_character_tokenizer.py

mattdangerw

Looks good! Thanks. Added a little more to the example to show how it could be used.

…keras-team#163)" This reverts commit a2c6067.

aflah02 added 4 commits May 4, 2022 01:28

Clamped Values

1451f43

Merge branch 'keras-team:master' into WorkingOnVocab

25a8714

Added New Tests, Fixed Old Tests

9d62980

Merge branch 'WorkingOnVocab' of https://github.com/aflah02/keras-nlp …

4e0ba7a

…into WorkingOnVocab

Ran formatters

5371401

mattdangerw requested changes May 3, 2022

View reviewed changes

keras_nlp/tokenizers/unicode_character_tokenizer.py Outdated Show resolved Hide resolved

keras_nlp/tokenizers/unicode_character_tokenizer_test.py Outdated Show resolved Hide resolved

keras_nlp/tokenizers/unicode_character_tokenizer.py Show resolved Hide resolved

Fixes based on reviews

8ce06b2

aflah02 requested a review from mattdangerw May 4, 2022 05:51

mattdangerw added 2 commits May 4, 2022 23:19

Show vocab cutoff on latin and non-latin chars

1b3113a

Add puctuation

35863ad

mattdangerw approved these changes May 5, 2022

View reviewed changes

mattdangerw merged commit a2c6067 into keras-team:master May 5, 2022

aflah02 added a commit to aflah02/keras-nlp that referenced this pull request May 14, 2022

Revert "Added a vocabulary_size argument to UnicodeCharacterTokenizer (…

9143430

…keras-team#163)" This reverts commit a2c6067.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added a vocabulary_size argument to UnicodeCharacterTokenizer #163

Added a vocabulary_size argument to UnicodeCharacterTokenizer #163

aflah02 commented May 3, 2022

aflah02 commented May 3, 2022

mattdangerw left a comment

mattdangerw left a comment

Added a vocabulary_size argument to UnicodeCharacterTokenizer #163

Added a vocabulary_size argument to UnicodeCharacterTokenizer #163

Conversation

aflah02 commented May 3, 2022

aflah02 commented May 3, 2022

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment