Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LM tokenizer can't handle whitespace token #7406

Closed
howl-anderson opened this issue Nov 30, 2020 · 2 comments
Closed

LM tokenizer can't handle whitespace token #7406

howl-anderson opened this issue Nov 30, 2020 · 2 comments
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.

Comments

@howl-anderson
Copy link
Contributor

Rasa version: 2.1.1

Python version: 3.7.6

Operating system (windows, osx, ...): Ubuntu 18.04.5 LTS

Issue:
when the training sample has " " (whitespace char) inside, e.g. '你好,我是 Silly,一个专注天气预报的对话机器人。'.
rasa train will raise ValueError: not enough values to unpack (expected 2, got 0)

Error (including full traceback):

  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/train.py", line 114, in train
    interpreter = trainer.train(training_data, **kwargs)
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/model.py", line 204, in train
    updates = component.train(working_data, self.config, **context)
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py", line 808, in train
    batch_docs = self._get_docs_for_batch(batch_messages, attribute)
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py", line 754, in _get_docs_for_batch
    batch_examples, attribute
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py", line 382, in _get_token_ids_for_batch
    example, attribute
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py", line 351, in _tokenize_example
    split_token_ids, split_token_strings
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/featurizers/dense_featurizer/lm_featurizer.py", line 288, in _lm_specific_token_cleanup
    return model_tokens_cleaners[self.model_name](split_token_ids, token_strings)
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/utils/hugging_face/transformers_pre_post_processors.py", line 234, in bert_tokens_cleaner
    return cleanup_tokens(list(zip(token_ids, token_strings)), "##")
  File "/home/howl/data/Envs/rasa_weather_bot-rasa-v2/lib/python3.7/site-packages/rasa/nlu/utils/hugging_face/transformers_pre_post_processors.py", line 26, in cleanup_tokens
    token_ids, token_strings = zip(*token_ids_string)
ValueError: not enough values to unpack (expected 2, got 0)

Command or request that led to error:

rasa train

domain.yml:

version: "2.0"
language: zh
pipeline:
    - name: JiebaTokenizer
    - name: LanguageModelFeaturizer
      model_name: bert
      model_weights: bert-base-chinese
...

I already know how to fix it, I will submit a PR for this later.

@howl-anderson howl-anderson added area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors. labels Nov 30, 2020
@sara-tagger
Copy link
Collaborator

Thanks for the issue, @tmbo will get back to you about it soon!

You may find help in the docs and the forum, too 🤗

@wochinge
Copy link
Contributor

closed. by #7407

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rasa-oss 🎡 Anything related to the open source Rasa framework type:bug 🐛 Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
None yet
Development

No branches or pull requests

3 participants