-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model hangs on eval #15
Comments
I'm wondering if this issue may be related: pytorch/pytorch#20275 |
hey Garrett at Madison! beautiful city, still have fond memories of it (worked at Epic Systems for a year right out of college) yup, i think i may have an idea of what's wrong, will throw in a fix in an hour |
@GarrettMerz basically i'm incorrectly caching by the sequence length, but it should cache the longest sequence length and slice out any subsequent calls with shorter ones |
@GarrettMerz want to give 0.5.0 a try and see if it still hangs? |
Updating this with results: this largely seems to fix things. I still see hanging behavior in cases where max output length is large and the model does not produce an EOS token before hitting the max length- i.e., if it gets stuck outputting something nonsensical like "++++++++++...", which may happen in early epochs: the length of this bad output is then cached, which causes ROPE to slow down a lot- but capping the max output length at a reasonable size generally seems to mitigate this, which is a good enough fix for now. |
@GarrettMerz sounds good, as long as it does not hang anymore best with your research and life out in the midwest |
May need to reopen this, it seems that things are still hanging! I'm going to try to investigate more to figure out when specifically it might be happening- I'm going to use the relative position encoding in the encoder only (not the decoder) and see if that helps at all. |
hmm, yea, i'll wait for more info from your end you are the only one reporting this |
@GarrettMerz could you try turning off cache altogether? https://github.com/lucidrains/rotary-embedding-torch/blob/main/rotary_embedding_torch/rotary_embedding_torch.py#L82 just to confirm that it is indeed caused by the freqs caching and not something on your end? |
Hi, I was also encountering this issue with the latest commit (v0.5.3). On my end, I can confirm that this is caused by caching. Setting |
I still seem to be encountering this issue in some cases- I'm investigating more to confirm it's not an implementation problem. |
I'm seeing consistent hangs with 8-gpu |
@lunixbochs I see! thank you for this info I'll try a way of standardizing the cache to same tensor shape across devices and ping you to give it a try when it is done |
the rope cache in fairseq seems to work fine: https://github.com/facebookresearch/fairseq/blob/920a548ca770fb1a951f7f4289b4d3a0c1bc226f/fairseq/modules/rotary_positional_embedding.py#L28 |
@lunixbochs sounds good, I'll take a look only if I can't figure it out |
@lunixbochs want to try 0.8.2 and see if it resolves your issue? |
just updated, not hanging for me anymore. thanks! |
if you see anything wrong with the loss curve, let me know, made some risky changes |
Hi! I'm running an enc-dec transformer with ROPE in the first self-attention layer of the encoder and decoder. I'm noticing that in the eval stage of my model, it hangs until my job times out after about 7 epochs; when running without this package, i.e. standard learnable positional encoding or shaw-style relative postion encoding, I do not notice this behavior. Are there any obvious places in this package that might lead to a memory leak or similar (i.e. does the register_buffer play nice with with_no_grad()?)
The text was updated successfully, but these errors were encountered: