Releases: huggingface/tokenizers
Releases · huggingface/tokenizers
v0.15.1.rc0
What's Changed
- pyo3: update to 0.19 by @mikelui in #1322
- Add
expect()
for disabling truncation by @boyleconnor in #1316 - Re-using scritpts from safetensors. by @Narsil in #1328
- Reduce number of different revisions by 1 by @Narsil in #1329
- Python 38 arm by @Narsil in #1330
- Move to maturing mimicking move for
safetensors
. + Rewritten node bindings. by @Narsil in #1331 - Updating the docs with the new command. by @Narsil in #1333
- Update added tokens by @ArthurZucker in #1335
- update package version for dev by @ArthurZucker in #1339
- Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in #1341
- Let's allow hf_hub < 1.0 by @ArthurZucker in #1344
- Fixing the progressbar. by @Narsil in #1353
- Preparing release. by @Narsil in #1355
- fix a clerical error in the comment by @tiandiweizun in #1356
- fix: remove useless token by @rtrompier in #1371
- Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node by @dependabot in #1370
- Allow hf_hub 0.18 by @mariosasko in #1383
- Allow
huggingface_hub<1.0
by @Wauplin in #1385 - [
pre_tokenizers
] Fix sentencepiece based Metaspace by @ArthurZucker in #1357 - udpate to version = "0.15.1-dev0" by @ArthurZucker in #1390
- Derive
Clone
onTokenizer
, addEncoding.into_tokens()
method by @epwalsh in #1381 - Stale bot. by @Narsil in #1404
- Fix doc links in readme by @Pierrci in #1367
- Faster HF dataset iteration in docs by @mariosasko in #1414
- Add quick doc to byte_level.rs by @steventrouble in #1420
- Fix make bench. by @Narsil in #1428
- Bump follow-redirects from 1.15.1 to 1.15.4 in /tokenizers/examples/unstable_wasm/www by @dependabot in #1430
- pyo3: update to 0.20 by @mikelui in #1386
New Contributors
- @mikelui made their first contribution in #1322
- @eaplatanios made their first contribution in #1341
- @tiandiweizun made their first contribution in #1356
- @rtrompier made their first contribution in #1371
- @mariosasko made their first contribution in #1383
- @Wauplin made their first contribution in #1385
- @steventrouble made their first contribution in #1420
Full Changelog: v0.13.4.rc2...v0.15.1.rc0
v0.15.0
What's Changed
- fix a clerical error in the comment by @tiandiweizun in #1356
- fix: remove useless token by @rtrompier in #1371
- Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node by @dependabot in #1370
- Allow hf_hub 0.18 by @mariosasko in #1383
- Allow
huggingface_hub<1.0
by @Wauplin in #1385 - [
pre_tokenizers
] Fix sentencepiece based Metaspace by @ArthurZucker in #1357
New Contributors
- @tiandiweizun made their first contribution in #1356
- @rtrompier made their first contribution in #1371
- @mariosasko made their first contribution in #1383
- @Wauplin made their first contribution in #1385
Full Changelog: v0.14.1...v0.15.0
v0.14.1
What's Changed
- Fix conda release by @ArthurZucker in #1211
- Fix node release by @ArthurZucker in #1212
- Printing warning to stderr. by @Narsil in #1222
- Fixing padding_left sequence_ids. by @Narsil in #1233
- Use LTO for release and benchmark builds by @csko in #1157
- fix unigram.rs test_sample() by @chris-ha458 in #1244
- implement a simple max_sentencepiece_length into BPE by @chris-ha458 in #1228
- Makes
decode
anddecode_batch
work on borrowed content. by @mfuntowicz in #1251 - Update all GH Actions with dependency on actions/checkout by @mfuntowicz in #1256
- Parallelize unigram trainer by @mishig25 in #976
- Update unigram/trainer.rs by @chris-ha458 in #1257
- Fixing broken link. by @Narsil in #1268
- fix documentation regarding regex by @chris-ha458 in #1264
- Update Cargo.toml by @chris-ha458 in #1266
- Update README.md - Broken link by @sbhavani in #1272
- [doc build] Use secrets by @mishig25 in #1273
- Improve error for truncation with too high stride by @boyleconnor in #1275
- Add unigram bytefallback by @ArthurZucker in #1217
- revise type specification by @hiroshi-matsuda-rit in #1289
- Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in #1291
- Update path name: master -> main by @bact in #1292
- import Tuple from typing by @kellymarchisio in #1295
- Fixing clippy warnings on 1.71. by @Narsil in #1296
- Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in #1299
- feat: Added CITATION.cff. by @SamuelLarkin in #1302
- Single warning for holes. by @Narsil in #1303
- Give error when initializing tokenizer with too high stride by @boyleconnor in #1306
- Handle when precompiled charsmap is empty by @kellymarchisio in #1308
- Derive clone for TrainerWrapper by @jonatanklosko in #1317
- CD backports by @chris-ha458 in #1318
- 0.13.4.rc1 by @Narsil in #1319
- Release all at once for simplicity. by @Narsil in #1320
- Fix stride condition. by @Narsil in #1321
- pyo3: update to 0.19 by @mikelui in #1322
- Add
expect()
for disabling truncation by @boyleconnor in #1316 - Re-using scritpts from safetensors. by @Narsil in #1328
- Reduce number of different revisions by 1 by @Narsil in #1329
- Python 38 arm by @Narsil in #1330
- Move to maturing mimicking move for
safetensors
. + Rewritten node bindings. by @Narsil in #1331 - Updating the docs with the new command. by @Narsil in #1333
- Update added tokens by @ArthurZucker in #1335
- update package version for dev by @ArthurZucker in #1339
- Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in #1341
- Let's allow hf_hub < 1.0 by @ArthurZucker in #1344
- Fixing the progressbar. by @Narsil in #1353
- Preparing release. by @Narsil in #1355
New Contributors
- @csko made their first contribution in #1157
- @chris-ha458 made their first contribution in #1244
- @sbhavani made their first contribution in #1272
- @boyleconnor made their first contribution in #1275
- @hiroshi-matsuda-rit made their first contribution in #1289
- @bact made their first contribution in #1292
- @kellymarchisio made their first contribution in #1295
- @SamuelLarkin made their first contribution in #1302
- @jonatanklosko made their first contribution in #1317
- @mikelui made their first contribution in #1322
- @eaplatanios made their first contribution in #1341
Full Changelog: v0.13.3...v0.14.1
v0.14.1rc1
What's Changed
- pyo3: update to 0.19 by @mikelui in #1322
- Add
expect()
for disabling truncation by @boyleconnor in #1316 - Re-using scritpts from safetensors. by @Narsil in #1328
- Reduce number of different revisions by 1 by @Narsil in #1329
- Python 38 arm by @Narsil in #1330
- Move to maturing mimicking move for
safetensors
. + Rewritten node bindings. by @Narsil in #1331 - Updating the docs with the new command. by @Narsil in #1333
- Update added tokens by @ArthurZucker in #1335
- update package version for dev by @ArthurZucker in #1339
- Added ability to inspect a 'Sequence' pre-tokenizer. by @eaplatanios in #1341
- Let's allow hf_hub < 1.0 by @ArthurZucker in #1344
- Fixing the progressbar. by @Narsil in #1353
New Contributors
- @mikelui made their first contribution in #1322
- @eaplatanios made their first contribution in #1341
Full Changelog: v0.13.4.rc2...v0.14.1rc1
v0.14.0
- #1335, AddedToken is reworked,
is_special_token
rename tospecial
for consistency - feature http is now
OFF
by default, and depends on hf-hub instead of cached_path (updated cache directory, better sync implementation) - Removed SSL link on the python package, calling huggingface_hub directly instead.
- New dependency : huggingface_hub (while we deprecate Tokenizer.from_pretrained(...) to Tokenizer.from_file(hugginngface_hub.hf_hub_download(MODEL_ID, "tokenizer.json")
What's Changed
- Fix conda release by @ArthurZucker in #1211
- Fix node release by @ArthurZucker in #1212
- Printing warning to stderr. by @Narsil in #1222
- Fixing padding_left sequence_ids. by @Narsil in #1233
- Use LTO for release and benchmark builds by @csko in #1157
- fix unigram.rs test_sample() by @chris-ha458 in #1244
- implement a simple max_sentencepiece_length into BPE by @chris-ha458 in #1228
- Makes
decode
anddecode_batch
work on borrowed content. by @mfuntowicz in #1251 - Update all GH Actions with dependency on actions/checkout by @mfuntowicz in #1256
- Parallelize unigram trainer by @mishig25 in #976
- Update unigram/trainer.rs by @chris-ha458 in #1257
- Fixing broken link. by @Narsil in #1268
- fix documentation regarding regex by @chris-ha458 in #1264
- Update Cargo.toml by @chris-ha458 in #1266
- Update README.md - Broken link by @sbhavani in #1272
- [doc build] Use secrets by @mishig25 in #1273
- Improve error for truncation with too high stride by @boyleconnor in #1275
- Add unigram bytefallback by @ArthurZucker in #1217
- revise type specification by @hiroshi-matsuda-rit in #1289
- Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in #1291
- Update path name: master -> main by @bact in #1292
- import Tuple from typing by @kellymarchisio in #1295
- Fixing clippy warnings on 1.71. by @Narsil in #1296
- Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in #1299
- feat: Added CITATION.cff. by @SamuelLarkin in #1302
- Single warning for holes. by @Narsil in #1303
- Give error when initializing tokenizer with too high stride by @boyleconnor in #1306
- Handle when precompiled charsmap is empty by @kellymarchisio in #1308
- Derive clone for TrainerWrapper by @jonatanklosko in #1317
- CD backports by @chris-ha458 in #1318
- 0.13.4.rc1 by @Narsil in #1319
- Release all at once for simplicity. by @Narsil in #1320
- Fix stride condition. by @Narsil in #1321
- pyo3: update to 0.19 by @mikelui in #1322
- Add
expect()
for disabling truncation by @boyleconnor in #1316 - Re-using scritpts from safetensors. by @Narsil in #1328
- Reduce number of different revisions by 1 by @Narsil in #1329
- Python 38 arm by @Narsil in #1330
- Move to maturing mimicking move for
safetensors
. + Rewritten node bindings. by @Narsil in #1331 - Updating the docs with the new command. by @Narsil in #1333
- Update added tokens by @ArthurZucker in #1335
New Contributors
- @csko made their first contribution in #1157
- @chris-ha458 made their first contribution in #1244
- @sbhavani made their first contribution in #1272
- @boyleconnor made their first contribution in #1275
- @hiroshi-matsuda-rit made their first contribution in #1289
- @bact made their first contribution in #1292
- @kellymarchisio made their first contribution in #1295
- @SamuelLarkin made their first contribution in #1302
- @jonatanklosko made their first contribution in #1317
- @mikelui made their first contribution in #1322
Full Changelog: v0.13.3...v0.14.0
v0.14.0.rc1
Reworks the release pipeline. Other breaking changes are mostly related to #1335, where AddedToken is reworked
What's Changed
- pyo3: update to 0.19 by @mikelui in #1322
- Add
expect()
for disabling truncation by @boyleconnor in #1316 - Re-using scritpts from safetensors. by @Narsil in #1328
- Reduce number of different revisions by 1 by @Narsil in #1329
- Python 38 arm by @Narsil in #1330
- Move to maturing mimicking move for
safetensors
. + Rewritten node bindings. by @Narsil in #1331 - Updating the docs with the new command. by @Narsil in #1333
- Update added tokens by @ArthurZucker in #1335
New Contributors
Full Changelog: v0.13.4.rc2...v0.14.0.rc1
v0.13.4.rc3
Mostly checking the new release scripts actually work.
What's Changed
- pyo3: update to 0.19 by @mikelui in #1322
- Add
expect()
for disabling truncation by @boyleconnor in #1316 - Re-using scritpts from safetensors. by @Narsil in #1328
New Contributors
Full Changelog: v0.13.4.rc2...v0.13.4.rc3
v0.13.4.rc2
Python v0.13.4.rc1
What's Changed
- Update all GH Actions with dependency on actions/checkout by @mfuntowicz in #1256
- Parallelize unigram trainer by @mishig25 in #976
- Update unigram/trainer.rs by @chris-ha458 in #1257
- Fixing broken link. by @Narsil in #1268
- fix documentation regarding regex by @chris-ha458 in #1264
- Update Cargo.toml by @chris-ha458 in #1266
- Update README.md - Broken link by @sbhavani in #1272
- [doc build] Use secrets by @mishig25 in #1273
- Improve error for truncation with too high stride by @boyleconnor in #1275
- Add unigram bytefallback by @ArthurZucker in #1217
- revise type specification by @hiroshi-matsuda-rit in #1289
- Bump tough-cookie from 4.0.0 to 4.1.3 in /bindings/node by @dependabot in #1291
- Update path name: master -> main by @bact in #1292
- import Tuple from typing by @kellymarchisio in #1295
- Fixing clippy warnings on 1.71. by @Narsil in #1296
- Bump word-wrap from 1.2.3 to 1.2.4 in /bindings/node by @dependabot in #1299
- feat: Added CITATION.cff. by @SamuelLarkin in #1302
- Single warning for holes. by @Narsil in #1303
- Give error when initializing tokenizer with too high stride by @boyleconnor in #1306
- Handle when precompiled charsmap is empty by @kellymarchisio in #1308
- Derive clone for TrainerWrapper by @jonatanklosko in #1317
- CD backports by @chris-ha458 in #1318
- 0.13.4.rc1 by @Narsil in #1319
New Contributors
- @sbhavani made their first contribution in #1272
- @boyleconnor made their first contribution in #1275
- @hiroshi-matsuda-rit made their first contribution in #1289
- @bact made their first contribution in #1292
- @kellymarchisio made their first contribution in #1295
- @SamuelLarkin made their first contribution in #1302
- @jonatanklosko made their first contribution in #1317
Full Changelog: v0.13.4-rc2...v0.13.4.rc1
v0.13.4-rc2: Makes `decode` and `decode_batch` work on borrowed content. (#1251)
Pre-release
* Makes `decode` and `decode_batch` work on borrowed content. * Make `decode_batch` work with borrowed content. * Fix lint. * Attempt to map it into Node. * Second attempt. * Step by step. * One more step. * Fix lint. * Please ... * Removing collect. * Revert "Removing collect." This reverts commit 2f7ec04dc84df3cc5488625a4fcb492fdc3545e2. --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>