Skip to content
This repository has been archived by the owner on Nov 22, 2022. It is now read-only.

Commit

Permalink
Dimension mismatch after setting max sequence length
Browse files Browse the repository at this point in the history
Summary:
TokenTensorizer and ByteTokenTensorizer has difference way of handling max sequence length. Usually this won't cause any problem if the model does not use the two tensorizers to process inputs and targets.
But for the smart keyboard, it uses TokenTensorizer to process labels and  ByteTokenTensorizer to process text inputs. This will cause dimension mismatch if the sentence is longer than the max sequence length.

```
TokenTokenizer: len(<EOS> + tokens + <BOS>) <= max sequence length
ByteTokenTensorizer: text <= max sequence length
```

This diff is to change the way ByteTokenTensorizer truncate text to max sequence length the same as TokenTokenize.

Reviewed By: psuzhanhy

Differential Revision: D18566684

fbshipit-source-id: 114af0e23b3bc66796371fabf8baee841dddd51b
  • Loading branch information
Fan Wang authored and facebook-github-bot committed Nov 20, 2019
1 parent c35d513 commit 69bfb1e
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion pytext/data/tensorizers.py
Original file line number Diff line number Diff line change
Expand Up @@ -456,7 +456,8 @@ def column_schema(self):

def numberize(self, row):
"""Convert text to bytes, pad batch."""
tokens = self.tokenizer.tokenize(row[self.text_column])[: self.max_seq_len]
tokens = self.tokenizer.tokenize(row[self.text_column])[:
(self.max_seq_len - self.add_bos_token - self.add_eos_token)]
if self.add_bos_token:
bos = EOS if self.use_eos_token_for_bos else BOS
tokens = [Token(bos, -1, -1)] + tokens
Expand Down

0 comments on commit 69bfb1e

Please sign in to comment.