Model hangs on eval #15

GarrettMerz · 2023-12-13T20:27:48Z

Hi! I'm running an enc-dec transformer with ROPE in the first self-attention layer of the encoder and decoder. I'm noticing that in the eval stage of my model, it hangs until my job times out after about 7 epochs; when running without this package, i.e. standard learnable positional encoding or shaw-style relative postion encoding, I do not notice this behavior. Are there any obvious places in this package that might lead to a memory leak or similar (i.e. does the register_buffer play nice with with_no_grad()?)

GarrettMerz · 2023-12-13T20:39:52Z

I'm wondering if this issue may be related: pytorch/pytorch#20275

lucidrains · 2023-12-14T14:35:45Z

hey Garrett at Madison! beautiful city, still have fond memories of it (worked at Epic Systems for a year right out of college)

yup, i think i may have an idea of what's wrong, will throw in a fix in an hour

lucidrains · 2023-12-14T14:36:25Z

@GarrettMerz basically i'm incorrectly caching by the sequence length, but it should cache the longest sequence length and slice out any subsequent calls with shorter ones

lucidrains · 2023-12-14T16:08:32Z

@GarrettMerz want to give 0.5.0 a try and see if it still hangs?

GarrettMerz · 2024-01-03T17:53:22Z

Updating this with results: this largely seems to fix things. I still see hanging behavior in cases where max output length is large and the model does not produce an EOS token before hitting the max length- i.e., if it gets stuck outputting something nonsensical like "++++++++++...", which may happen in early epochs: the length of this bad output is then cached, which causes ROPE to slow down a lot- but capping the max output length at a reasonable size generally seems to mitigate this, which is a good enough fix for now.

lucidrains · 2024-01-03T18:24:21Z

@GarrettMerz sounds good, as long as it does not hang anymore

best with your research and life out in the midwest

GarrettMerz · 2024-01-06T03:23:08Z

May need to reopen this, it seems that things are still hanging! I'm going to try to investigate more to figure out when specifically it might be happening- I'm going to use the relative position encoding in the encoder only (not the decoder) and see if that helps at all.

lucidrains · 2024-01-06T14:33:39Z

hmm, yea, i'll wait for more info from your end

you are the only one reporting this

lucidrains · 2024-01-07T01:07:39Z

@GarrettMerz could you try turning off cache altogether? https://github.com/lucidrains/rotary-embedding-torch/blob/main/rotary_embedding_torch/rotary_embedding_torch.py#L82 just to confirm that it is indeed caused by the freqs caching and not something on your end?

jozhang97 · 2024-03-05T18:45:28Z

Hi, I was also encountering this issue with the latest commit (v0.5.3). On my end, I can confirm that this is caused by caching. Setting cache_if_possible=False worked for me.

GarrettMerz · 2024-04-12T22:10:05Z

I still seem to be encountering this issue in some cases- I'm investigating more to confirm it's not an implementation problem.

lunixbochs · 2024-09-06T19:22:57Z

I'm seeing consistent hangs with 8-gpu accelerate based DDP.
Hangs on NCCL comms after the first step if I'm using 3+ gpus.
My sequence lengths are both different per step and per accelerator.
No issue if cache is disabled, or with 2 gpus, or if I use a fixed sequence length.

lucidrains · 2024-09-06T19:56:15Z

@lunixbochs I see! thank you for this info

I'll try a way of standardizing the cache to same tensor shape across devices and ping you to give it a try when it is done

lunixbochs · 2024-09-06T20:55:24Z

the rope cache in fairseq seems to work fine: https://github.com/facebookresearch/fairseq/blob/920a548ca770fb1a951f7f4289b4d3a0c1bc226f/fairseq/modules/rotary_positional_embedding.py#L28

lucidrains · 2024-09-06T21:07:51Z

@lunixbochs sounds good, I'll take a look only if I can't figure it out

lucidrains · 2024-09-06T23:47:56Z

@lunixbochs want to try 0.8.2 and see if it resolves your issue?

lunixbochs · 2024-09-07T16:07:24Z

just updated, not hanging for me anymore. thanks!

lucidrains · 2024-09-07T16:10:55Z

if you see anything wrong with the loss curve, let me know, made some risky changes

lucidrains added a commit that referenced this issue Dec 14, 2023

address #15

f3dc708

lucidrains closed this as completed Dec 20, 2023

lucidrains reopened this Sep 6, 2024

lucidrains added a commit that referenced this issue Sep 6, 2024

address #15 (comment)

1c579c5

lucidrains added a commit that referenced this issue Sep 7, 2024

address #15 (comment)

93e57f0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model hangs on eval #15

Model hangs on eval #15

GarrettMerz commented Dec 13, 2023 •

edited

Loading

GarrettMerz commented Dec 13, 2023

lucidrains commented Dec 14, 2023

lucidrains commented Dec 14, 2023

lucidrains commented Dec 14, 2023

GarrettMerz commented Jan 3, 2024 •

edited

Loading

lucidrains commented Jan 3, 2024

GarrettMerz commented Jan 6, 2024

lucidrains commented Jan 6, 2024

lucidrains commented Jan 7, 2024

jozhang97 commented Mar 5, 2024

GarrettMerz commented Apr 12, 2024

lunixbochs commented Sep 6, 2024

lucidrains commented Sep 6, 2024

lunixbochs commented Sep 6, 2024

lucidrains commented Sep 6, 2024

lucidrains commented Sep 6, 2024 •

edited

Loading

lunixbochs commented Sep 7, 2024

lucidrains commented Sep 7, 2024

Model hangs on eval #15

Model hangs on eval #15

Comments

GarrettMerz commented Dec 13, 2023 • edited Loading

GarrettMerz commented Dec 13, 2023

lucidrains commented Dec 14, 2023

lucidrains commented Dec 14, 2023

lucidrains commented Dec 14, 2023

GarrettMerz commented Jan 3, 2024 • edited Loading

lucidrains commented Jan 3, 2024

GarrettMerz commented Jan 6, 2024

lucidrains commented Jan 6, 2024

lucidrains commented Jan 7, 2024

jozhang97 commented Mar 5, 2024

GarrettMerz commented Apr 12, 2024

lunixbochs commented Sep 6, 2024

lucidrains commented Sep 6, 2024

lunixbochs commented Sep 6, 2024

lucidrains commented Sep 6, 2024

lucidrains commented Sep 6, 2024 • edited Loading

lunixbochs commented Sep 7, 2024

lucidrains commented Sep 7, 2024

GarrettMerz commented Dec 13, 2023 •

edited

Loading

GarrettMerz commented Jan 3, 2024 •

edited

Loading

lucidrains commented Sep 6, 2024 •

edited

Loading