Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run_squad_trainer doesn't actually use a Rust tokenizer + errors in squad_convert_example_to_features when using a Rust tokenizer #7492

Closed
3 of 4 tasks
k8si opened this issue Sep 30, 2020 · 4 comments
Labels

Comments

@k8si
Copy link

k8si commented Sep 30, 2020

Environment info

  • transformers version: 3.3.1
  • Platform: Linux-4.14.35-1902.303.4.1.el7uek.x86_64-x86_64-with-oracle-7.8
  • Python version: 3.6.8
  • PyTorch version (GPU?): 1.6.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

and

  • transformers version: 3.3.1
  • Platform: macOS-10.15.6-x86_64-i386-64bit
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.6.0 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no

Who can help

@mfuntowicz
@LysandreJik
@patil-suraj

Information

Model I am using (Bert, XLNet ...): bert-base-uncased

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

Firstly, in run_squad_trainer.py, I noticed that the "use_fast" arg doesn't get propagated into the tokenizer instantiation:

tokenizer = AutoTokenizer.from_pretrained(
Probably should be

    tokenizer = AutoTokenizer.from_pretrained(
        model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
        cache_dir=model_args.cache_dir,
        use_fast=model_args.use_fast
    )

However, when I make that change, the script hangs at the call to squad_convert_examples_to_features in SquadProcessor.

So, I did a little digging. The error is in squad_convert_example_to_features and seems to be due to inconsistencies in the behavior of tokenizer.encode_plus between the Python and Rust tokenizers, detailed below. I've also provided a gist that hopefully elucidates & will help reproduce each of these points. I tested both BertTokenizer/BertTokenizerFast and GPT2Tokenizer/GPT2TokenizerFast.

  1. Python tokenizers handle negative values for stride, Rust tokenizers throw an exception (OverflowError: can't convert negative int to unsigned)

  2. For sequence pairs, Python tokenizers are fine if the first arg (text) is a list of ints and the second arg (text_pair) is a list of strings. The Rust tokenizers throw an exception ValueError: PreTokenizedInputSequence must be Union[List[str], Tuple[str]]. (Furthermore, the typehints for these arguments indicate that a string, a list of strings, or a list of ints are all fine.)

  3. Leaving the is_split_into_words kwarg at its default value (False), then running tokenizer.encode_plus(list of ints) works fine for the Python tokenizers. The Rust tokenizers raise an exception ValueError: TextInputSequence must be str.

  4. When running on a pair of sequences and setting return_tensors=None, the Python tokenizers return an output dict with input_ids (and other elements) as a list of ints i.e.input_ids = [id1, id2, ...] whereas the Rust tokenizers return a dict with input_ids as a list of list of ints i.e. input_ids = [[id1, id2, ...]]. I also noticed that if you set return_tensors="pt", both the Python and Rust tokenizers return input_ids = tensor([[id1, id2, ...]]).

  5. When return_overflowing_tokens=True, the Python tokenizers return a list of the overflowing tokens at key overflowing_tokens as expected. The Rust tokenizers return them at key overflow_to_sample_mapping which is not documented anywhere, as far as I can tell. The values seem to be different for the Python output vs. Rust output.

  6. Running the same procedure on the same input twice produces the same result each time for the Python tokenizer. For the Rust tokenizer, the result of the second run is different. I am not familiar enough with the Rust tokenizer internals at this point to have a theory as to why this is the case. Anyway, this is the point at which I stopped debugging and decided to file an issue.

To reproduce

Steps to reproduce the behavior:

  1. Download squad 2.0 dataset from "official" squad website
  2. Make fix in run_squad_training.py described above to correctly instantiate a Rust tokenizer
  3. Run script: python examples/question-answering/run_squad_trainer.py --model_name_or_path bert-base-uncased --use_fast --output_dir "./outputs-squad" --do_train --data_dir "./squad-data" --version_2_with_negative

Also see gist detailing issues described above: https://gist.github.com/k8si/a143346dfa875c28d98e95cba1f82f1b

Expected behavior

  1. I expected run_squad_trainer.py to use a Rust tokenizer when the use_fast arg was set to True
  2. I expected SquadProcessor.squad_convert_example_to_features to not raise exceptions when processing squad data when using a Rust tokenizer
  3. I expected tokenizer.encode_plus to return the same outputs given the same inputs, regardless of whether the tokenizer is a Rust tokenizer or a Python tokenizer
@LysandreJik
Copy link
Member

Hello! Indeed, the Rust tokenizers are not handled by the SQuAD data processing. This is one item we would like to resolve when refactoring the data processing methods, which will soon be implemented directly in datasets rather than in transformers.

Thank you for your detailed issue!

@k8si
Copy link
Author

k8si commented Oct 15, 2020

For what it's worth, my main issue is with the behavioral issues with the Python vs. Rust tokenizers, not with the SQuAD data processing itself (I can easily write my own SQuAD processor but writing my own tokenizers is more work--the tokenizers are part of why I've been using the library in the first place).

It places a sizeable burden on people coding against the library that different kwargs result in completely different behaviors (and different errors) between the Rust vs. Python implementations, and these differences often aren't documented. And Item # 6 seems like a fundamental error somewhere in the Rust codebase.

Are there any plans to address these issues in the near future? For what it's worth, I've never had an issue with the Python tokenizers. I'd like to use the Rust ones because they're Fast, plus I can train them myself easily, but navigating the API weirdness has been a slog.

@LysandreJik
Copy link
Member

LysandreJik commented Oct 16, 2020

You're correct that there is currently a mismatch between the python and rust tokenizers, and thank you for writing such a detailed issue explaining all of your pain points. We'll keep a close eye to the issues mentioned here as we continue working and improving the compatibility between the two APIs, which is something we will be focusing on in the near future.

@stale
Copy link

stale bot commented Dec 15, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Dec 15, 2020
@stale stale bot closed this as completed Dec 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants