This repository has been archived by the owner on Nov 22, 2022. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Dimension mismatch after setting max sequence length
Summary: TokenTensorizer and ByteTokenTensorizer has difference way of handling max sequence length. Usually this won't cause any problem if the model does not use the two tensorizers to process inputs and targets. But for the smart keyboard, it uses TokenTensorizer to process labels and ByteTokenTensorizer to process text inputs. This will cause dimension mismatch if the sentence is longer than the max sequence length. ``` TokenTokenizer: len(<EOS> + tokens + <BOS>) <= max sequence length ByteTokenTensorizer: text <= max sequence length ``` This diff is to change the way ByteTokenTensorizer truncate text to max sequence length the same as TokenTokenize. Reviewed By: psuzhanhy Differential Revision: D18566684 fbshipit-source-id: 114af0e23b3bc66796371fabf8baee841dddd51b
- Loading branch information