Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TransformerEncoder is not causal #2478

Closed
bigheary opened this issue Jun 18, 2023 · 10 comments · Fixed by #2662
Closed

TransformerEncoder is not causal #2478

bigheary opened this issue Jun 18, 2023 · 10 comments · Fixed by #2662

Comments

@bigheary
Copy link

bigheary commented Jun 18, 2023

Add Link

https://pytorch.org/tutorials/beginner/transformer_tutorial.html

image

for language modeling, src_mask should be mask future words

Describe the bug

is there anything wrong?

Describe your environment

colab

cc @pytorch/team-text-core @Nayef211 @sekyondaMeta @svekars @carljparker @NicolasHug @kit1980 @subramen

@bigheary bigheary added the bug label Jun 18, 2023
@jorendorff
Copy link

jorendorff commented Jun 25, 2023

Yes, this has to be a bug. I tried using the resulting model to do some basic predictions and it was never able to give a prediction for the next word that made any sense, despite reaching a perplexity of 2.9 on the validation and test data. Without a mask, every word except the last (the one that isn't present in the input) should be easy for the model to predict flawlessly.

@jorendorff
Copy link

You can play with it here (a copy of the notebook with a cell added at the end for making predictions)

jorendorff added a commit to jorendorff/machine-learning that referenced this issue Jun 27, 2023
At this point I noticed it is totally broken and found pytorch/tutorials#2478.
@xanderex-sid
Copy link

/assigntome

@nvs-abhilash
Copy link

/assigntome

@nvs-abhilash nvs-abhilash removed their assignment Nov 3, 2023
@xanderex-sid
Copy link

/assigntome

@xanderex-sid
Copy link

@bigheary Can you tell me more specifically, why you think src_mask should be mask future words (tgt_mask) ??

@jorendorff
Copy link

The text of the tutorial says:

Along with the input sequence, a square attention mask is required because the self-attention layers in nn.TransformerDecoder are only allowed to attend the earlier positions in the sequence. For the language modeling task, any tokens on the future positions should be masked.

@jorendorff
Copy link

jorendorff commented Nov 6, 2023

The purpose of the mask is to prevent the model from seeing the token it's supposed to be predicting. Without a mask, the task is trivial and the model doesn't learn.

Other tutorials also include this mask. Andrej Karpathy's video, for example, lingers for 15 minutes on the subject, starting here, and then later comes back and discusses it again, saying "you have to mask ... so that nodes from the future never talk with the past, because they would give away the answer". (He points out that non-predictive applications don't need the mask, but the tutorial at issue here begins "This is a tutorial on training a model to predict the next word[...].")

@xanderex-sid xanderex-sid removed their assignment Nov 7, 2023
@ahoblitz
Copy link
Contributor

ahoblitz commented Nov 8, 2023

/assigntome

@ahoblitz ahoblitz mentioned this issue Nov 8, 2023
4 tasks
@ahoblitz
Copy link
Contributor

ahoblitz commented Nov 8, 2023

Added a generate_square_subsequent_mask with 30 epochs to get the following based on some of the discussion here
Screenshot 2023-11-07 at 7 40 55 PM

Let me know if this follows the spirit of the conversation or not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants