Dimension mismatch after setting max sequence length

Summary: TokenTensorizer and ByteTokenTensorizer has difference way of handling max sequence length. Usually this won't cause any problem if the model does not use the two tensorizers to process inputs and targets. But for the smart keyboard, it uses TokenTensorizer to process labels and ByteTokenTensorizer to process text inputs. This will cause dimension mismatch if the sentence is longer than the max sequence length. ``` TokenTokenizer: len(<EOS> + tokens + <BOS>) <= max sequence length ByteTokenTensorizer: text <= max sequence length ``` This diff is to change the way ByteTokenTensorizer truncate text to max sequence length the same as TokenTokenize. Reviewed By: psuzhanhy Differential Revision: D18566684 fbshipit-source-id: 114af0e23b3bc66796371fabf8baee841dddd51b
facebookresearch · Nov 20, 2019 · 69bfb1e · 69bfb1e
1 parent c35d513
commit 69bfb1e
Showing 1 changed file with 2 additions and 1 deletion.
diff --git a/pytext/data/tensorizers.py b/pytext/data/tensorizers.py
@@ -456,7 +456,8 @@ def column_schema(self):
 
     def numberize(self, row):
         """Convert text to bytes, pad batch."""
-        tokens = self.tokenizer.tokenize(row[self.text_column])[: self.max_seq_len]
+        tokens = self.tokenizer.tokenize(row[self.text_column])[:
+             (self.max_seq_len - self.add_bos_token - self.add_eos_token)]
         if self.add_bos_token:
             bos = EOS if self.use_eos_token_for_bos else BOS
             tokens = [Token(bos, -1, -1)] + tokens