Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model hangs on eval #15

Open
GarrettMerz opened this issue Dec 13, 2023 · 18 comments
Open

Model hangs on eval #15

GarrettMerz opened this issue Dec 13, 2023 · 18 comments

Comments

@GarrettMerz
Copy link

GarrettMerz commented Dec 13, 2023

Hi! I'm running an enc-dec transformer with ROPE in the first self-attention layer of the encoder and decoder. I'm noticing that in the eval stage of my model, it hangs until my job times out after about 7 epochs; when running without this package, i.e. standard learnable positional encoding or shaw-style relative postion encoding, I do not notice this behavior. Are there any obvious places in this package that might lead to a memory leak or similar (i.e. does the register_buffer play nice with with_no_grad()?)

@GarrettMerz
Copy link
Author

I'm wondering if this issue may be related: pytorch/pytorch#20275

@lucidrains
Copy link
Owner

hey Garrett at Madison! beautiful city, still have fond memories of it (worked at Epic Systems for a year right out of college)

yup, i think i may have an idea of what's wrong, will throw in a fix in an hour

@lucidrains
Copy link
Owner

@GarrettMerz basically i'm incorrectly caching by the sequence length, but it should cache the longest sequence length and slice out any subsequent calls with shorter ones

lucidrains added a commit that referenced this issue Dec 14, 2023
@lucidrains
Copy link
Owner

@GarrettMerz want to give 0.5.0 a try and see if it still hangs?

@GarrettMerz
Copy link
Author

GarrettMerz commented Jan 3, 2024

Updating this with results: this largely seems to fix things. I still see hanging behavior in cases where max output length is large and the model does not produce an EOS token before hitting the max length- i.e., if it gets stuck outputting something nonsensical like "++++++++++...", which may happen in early epochs: the length of this bad output is then cached, which causes ROPE to slow down a lot- but capping the max output length at a reasonable size generally seems to mitigate this, which is a good enough fix for now.

@lucidrains
Copy link
Owner

@GarrettMerz sounds good, as long as it does not hang anymore

best with your research and life out in the midwest

@GarrettMerz
Copy link
Author

May need to reopen this, it seems that things are still hanging! I'm going to try to investigate more to figure out when specifically it might be happening- I'm going to use the relative position encoding in the encoder only (not the decoder) and see if that helps at all.

@lucidrains
Copy link
Owner

hmm, yea, i'll wait for more info from your end

you are the only one reporting this

@lucidrains
Copy link
Owner

@GarrettMerz could you try turning off cache altogether? https://github.com/lucidrains/rotary-embedding-torch/blob/main/rotary_embedding_torch/rotary_embedding_torch.py#L82 just to confirm that it is indeed caused by the freqs caching and not something on your end?

@jozhang97
Copy link

Hi, I was also encountering this issue with the latest commit (v0.5.3). On my end, I can confirm that this is caused by caching. Setting cache_if_possible=False worked for me.

@GarrettMerz
Copy link
Author

I still seem to be encountering this issue in some cases- I'm investigating more to confirm it's not an implementation problem.

@lunixbochs
Copy link

I'm seeing consistent hangs with 8-gpu accelerate based DDP.
Hangs on NCCL comms after the first step if I'm using 3+ gpus.
My sequence lengths are both different per step and per accelerator.
No issue if cache is disabled, or with 2 gpus, or if I use a fixed sequence length.

@lucidrains
Copy link
Owner

@lunixbochs I see! thank you for this info

I'll try a way of standardizing the cache to same tensor shape across devices and ping you to give it a try when it is done

@lucidrains lucidrains reopened this Sep 6, 2024
@lunixbochs
Copy link

the rope cache in fairseq seems to work fine: https://github.com/facebookresearch/fairseq/blob/920a548ca770fb1a951f7f4289b4d3a0c1bc226f/fairseq/modules/rotary_positional_embedding.py#L28

@lucidrains
Copy link
Owner

@lunixbochs sounds good, I'll take a look only if I can't figure it out

lucidrains added a commit that referenced this issue Sep 6, 2024
@lucidrains
Copy link
Owner

lucidrains commented Sep 6, 2024

@lunixbochs want to try 0.8.2 and see if it resolves your issue?

lucidrains added a commit that referenced this issue Sep 7, 2024
@lunixbochs
Copy link

just updated, not hanging for me anymore. thanks!

@lucidrains
Copy link
Owner

if you see anything wrong with the loss curve, let me know, made some risky changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants