LM tokenizer can't handle whitespace token #7406

howl-anderson · 2020-11-30T07:12:12Z

Rasa version: 2.1.1

Python version: 3.7.6

Operating system (windows, osx, ...): Ubuntu 18.04.5 LTS

Issue:
when the training sample has " " (whitespace char) inside, e.g. '你好，我是 Silly，一个专注天气预报的对话机器人。'.
rasa train will raise ValueError: not enough values to unpack (expected 2, got 0)

Error (including full traceback):

  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/train.py", line 114, in train
    interpreter = trainer.train(training_data, **kwargs)
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/model.py", line 204, in train
    updates = component.train(working_data, self.config, **context)
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py", line 808, in train
    batch_docs = self._get_docs_for_batch(batch_messages, attribute)
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py", line 754, in _get_docs_for_batch
    batch_examples, attribute
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py", line 382, in _get_token_ids_for_batch
    example, attribute
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py", line 351, in _tokenize_example
    split_token_ids, split_token_strings
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py", line 288, in _lm_specific_token_cleanup
    return model_tokens_cleaners[self.model_name](split_token_ids, token_strings)
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/utils/hugging_face/transformers_pre_post_processors.py", line 234, in bert_tokens_cleaner
    return cleanup_tokens(list(zip(token_ids, token_strings)), "##")
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/utils/hugging_face/transformers_pre_post_processors.py", line 26, in cleanup_tokens
    token_ids, token_strings = zip(*token_ids_string)
ValueError: not enough values to unpack (expected 2, got 0)

Command or request that led to error:

rasa train

domain.yml:

version: "2.0"
language: zh
pipeline:
    - name: JiebaTokenizer
    - name: LanguageModelFeaturizer
      model_name: bert
      model_weights: bert-base-chinese
...

I already know how to fix it, I will submit a PR for this later.

The text was updated successfully, but these errors were encountered:

sara-tagger · 2020-12-01T08:51:20Z

Thanks for the issue, @tmbo will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

wochinge · 2021-01-29T09:46:58Z

closed. by #7407

howl-anderson added area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Nov 30, 2020

howl-anderson mentioned this issue Nov 30, 2020

fix whitespace LM tokenize issue #7407

Merged

4 tasks

wochinge closed this as completed Jan 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LM tokenizer can't handle whitespace token #7406

LM tokenizer can't handle whitespace token #7406

howl-anderson commented Nov 30, 2020

sara-tagger commented Dec 1, 2020

wochinge commented Jan 29, 2021

LM tokenizer can't handle whitespace token #7406

LM tokenizer can't handle whitespace token #7406

Comments

howl-anderson commented Nov 30, 2020

sara-tagger commented Dec 1, 2020

You may find help in the docs and the forum, too 🤗

wochinge commented Jan 29, 2021