discuss whether it worked or didn't work #1

lucidrains · 2023-02-15T18:53:30Z

lucidrains
Feb 15, 2023
Maintainer

share positive or negative results here

please leave the batch size and other hyperparameters

please also leave the learning rate scheduler, if one was used

lucidrains · 2023-02-16T16:00:13Z

lucidrains
Feb 16, 2023
Maintainer Author

https://crumbly.medium.com/small-scale-home-evaluation-of-googles-new-optimizer-lion-77115ba8a1a4

0 replies

lucidrains · 2023-02-16T16:39:30Z

lucidrains
Feb 16, 2023
Maintainer Author

#4 (comment)

0 replies

anhnami · 2023-02-17T04:51:48Z

anhnami
Feb 17, 2023

Didn't work for speech enhancement tasks when I ran a sweep with Random search and ASHA: optimizer=["Adam", "Adan", "Lion"] and lr is log uniform from 1e-5 to 1e-2. Note that Lion's trials were killed early in the experiments by ASHA. Here is the top performed runs:

I will do an extra exp without ASHA.

8 replies

lucidrains Feb 17, 2023
Maintainer Author

@anhnami please do, thank you

lucidrains Feb 17, 2023
Maintainer Author

try with higher batch sizes too, maybe at least 16 or 32. seems like it is batch size sensitive technique

xiangning-chen Feb 18, 2023

Thanks for testing the Lion optimizer! We observe that at the very beginning, Lion may perform worse than Adam but it will catch up soon, there is another example: https://crumbly.medium.com/small-scale-home-evaluation-of-googles-new-optimizer-lion-77115ba8a1a4.

anhnami Feb 18, 2023

Multiple runs with different seeds on a real validation set, hp_metric is predicted MOS, Lion didn't perform as good as Adam for this dense prediction task.

Similar results are also observed for other metrics:

I can increase the batch size, but I would need to reduce the sequence length to fit into a 3090; at the end of the day, the model will see a similar amount of data.

lucidrains Feb 18, 2023
Maintainer Author

@anhnami thank you!

lucidrains · 2023-02-17T18:59:24Z

lucidrains
Feb 17, 2023
Maintainer Author

https://twitter.com/alexjc/status/1626586985975660546

0 replies

lucidrains · 2023-02-17T19:04:50Z

lucidrains
Feb 17, 2023
Maintainer Author

lucidrains/denoising-diffusion-pytorch#170 (comment)

0 replies

lucidrains · 2023-02-18T04:53:45Z

lucidrains
Feb 18, 2023
Maintainer Author

this doesn't look great https://github.com/sinpcw/showcase-optimizer

2 replies

crazydonkey200 Feb 21, 2023

Thanks for sharing the comparisons! I took a look at this repo and found that it applied weight decay to AdamW but not Lion, so I just added a pull request to fix that for fair comparison.

lucidrains Feb 21, 2023
Maintainer Author

oh good catch! hope they update the results

personally, i think it is shaping up to be an adamw replacement for sufficiently large batch size training, based on what i'm hearing! congrats!

lucidrains · 2023-02-18T17:52:52Z

lucidrains
Feb 18, 2023
Maintainer Author

negative result from RL https://twitter.com/kaixhin/status/1626772629796564992

0 replies

lucidrains · 2023-02-19T01:54:17Z

lucidrains
Feb 19, 2023
Maintainer Author

slightly worse than Adam https://twitter.com/kyo_takano/status/1627147339143200768

0 replies

lucidrains · 2023-02-21T16:58:19Z

lucidrains
Feb 21, 2023
Maintainer Author

better than Adam at extreme batch sizes (16k) at open clip mlfoundations/open_clip#432

0 replies

mlw214 · 2023-02-22T15:12:33Z

mlw214
Feb 22, 2023

I'm currently investigating Lion for an object detection problem I'm working on. Unfortunately, I'm running into a lot of problems finding a good learning rate that works throughout the whole run. Seems for whatever reason it's prone to NaNing with our model/dataset when scaled up to what we used for training production models. Currently in the process of confirming if some other changes are the cause. I'll post updates here.

For reference, here's the model and a brief description of the dataset (which is proprietary):
Backbone: ConvNeXt-XL
FPN: Heavily modified BiFPN
Head: Dynamic RCNN
Current optimizer: MADGRAD
Dataset: can't say too much, but a significant portion of the objects are quite small, and for one of the classes there are a lot of confounders present. It's also fairly large.

14 replies

xiangning-chen Mar 6, 2023

May I know what kind of augmentations you are using? I tested with randaug and mixup, and Lion works pretty good.
Another observation is that Lion usually leads to larger weight norm, not sure whether this affects the fine-tuning process.
I can try object detection task on my end as well.

mlw214 Mar 7, 2023

We use Trivial Augment (re-implemented using Albumentations for object detection support) with horizontal flipping, along with Mosaic and Mixup augmentations (both with p=0.5 of being applied to a given image). I'll explore in more depth some of the model watching tools Weights and Biases provides. Currently, we're tracking gradients, but not weights. I'll expand it to track weights as well to try to see if any difference between AdamW and Lion pop up.

mlw214 Mar 7, 2023

Just kicked off the following experiment:

ConvNeXt-XL backbone
FasterRCNN detection head
HFlip augmentation only

If the above doesn't work I'll start experimenting with the backbone and FPN.

mlw214 Mar 9, 2023

Update: the above experiment failed. I then tried another experiment based on the above but also removed all empty images from training. That greatly helped with stability (got my longest run yet), but towards the end it still threw a NaN. Gonna swap out our custom BiFPN with the original FPN and see what happens.

mlw214 Mar 14, 2023

Alright, wasn't able to get it to work with a ConvNeXt backbone at all. However, switching to a ResNet 50 did work. I still plan to compare it against AdamW and MADGRAD, but it's a low priority at the moment since it didn't work with our production architecture, which is sad.

lucidrains · 2023-02-23T18:19:46Z

lucidrains
Feb 23, 2023
Maintainer Author

positive result from someone really prominent in the text-to-image field. large batch sizes (1024), learning rate / 10, weight decay kept the same

0 replies

Fred-Erik · 2023-02-27T12:24:30Z

Fred-Erik
Feb 27, 2023

Tried Lion with a proprietary OCR model, roughly similar to PP-OCRv3 (https://arxiv.org/abs/2206.03001). Compared to AdamW I got much slower convergence to a lower accuracy with learning rate / 3, weight decay * 3. With the same learning rate and weight decay I get NaN results. Batch size 2048.

13 replies

lucidrains Feb 28, 2023
Maintainer Author

@xiangning-chen once (if) you can resolve some of these issues, it would be nice for you to update the paper with some of these details (say, learning rate cooldown), to increase the chances of success for those trying lion

Fred-Erik Mar 8, 2023

Did some more experiments. wd=9.0 converges much slower, so weight decay * 30 seems to be too much. But also dividing the ending learning rate of the cosine learning rate schedule does work! So with lr / 10, wd * 10, end lr / 10, Lion is better than best settings for AdamW (green is Lion):

Thank you for your help!

lucidrains Mar 8, 2023
Maintainer Author

@Fred-Erik thank you!

xiangning-chen Mar 8, 2023

@Fred-Erik Thank you so much!

Fred-Erik May 16, 2023

Actually just did a new observation: post-training quantization consistently does not work with a Lion-optimized network, where it does work with an AdamW-trained network (n=4). I tried dynamic and static quantization (using ONNXRuntime: https://onnxruntime.ai/docs/performance/model-optimizations/quantization.html) and both have the same issue of nonsensical output. An analysis of the difference in weights and activations per layer reveals there is no one specific layer to blame, but lots of layers have a much greater difference between the float32 output and int8 when trained with Lion than when trained with AdamW. So for a version of my model that needs quantization, I'm reverting back to AdamW.

lucidrains · 2023-03-01T01:23:58Z

lucidrains
Mar 1, 2023
Maintainer Author

negative result for LM https://twitter.com/VHellendoorn/status/1630737104975085568

edit: now positive result! https://twitter.com/VHellendoorn/status/1631349009473478656?s=20

7 replies

xiangning-chen Mar 2, 2023

https://twitter.com/VHellendoorn/status/1631097679693463554?s=20

lucidrains Mar 2, 2023
Maintainer Author

https://twitter.com/VHellendoorn/status/1631097679693463554?s=20

! 🤞

xiangning-chen Mar 2, 2023

positive now: https://twitter.com/VHellendoorn/status/1631349009473478656?s=20

lucidrains Mar 2, 2023
Maintainer Author

@xiangning-chen amazing! thank you! even if Lion is only useful for language model training, that is a huge step forward

lucidrains Mar 2, 2023
Maintainer Author

i place a lot of weight on Vincent's results, things are looking upwards!

lucidrains · 2023-03-02T18:17:04Z

lucidrains
Mar 2, 2023
Maintainer Author

positive result, 3x faster for vision transformer fine tuning https://twitter.com/Haoxiang__Wang/status/1631355469439590412?s=20

0 replies

lucidrains · 2023-03-04T16:59:15Z

lucidrains
Mar 4, 2023
Maintainer Author

yet another positive result for training LLM from a really good researcher, however he also told me fine tuning was not as good (albeit the based model was trained with Adam, not sure if that makes any difference)

2 replies

xiangning-chen Mar 4, 2023

Thanks for the update! I was wondering, when fine-tuning, do they load the Adam optimizer state from pre-training?
Previously we find that this is helpful. However, as the LLM is pre-trained by Adam, Lion is unable to utilize the optimizer state. Therefore, to ensure a fair comparison, we disable this feature for both optimizers.

lucidrains Mar 4, 2023
Maintainer Author

@xiangning-chen oh yes, that's a good point, i'll ask him about that next time we chat

anhnami · 2023-04-04T07:49:20Z

anhnami
Apr 4, 2023

Saw this, could be relevant to Lion: https://openreview.net/pdf?id=a65YK0cqH8g

0 replies

Kingsleyandher · 2023-05-22T11:47:00Z

Kingsleyandher
May 22, 2023

Hello, I used megatron to train GPT-2 with 16B parameter number, but when the learning rate increased to 3e-6, the gradient overflow led to the failure of convergence. May I ask why?

My hyperparameter：
Adamw: lr=1.5e-4 min-lr=1e-5 wd=0.01
Lion: lr=1.5e-5 min-lr=1e-6 wd=0.1 beta1=0.95 beta2-0.98

4 replies

Kingsleyandher May 22, 2023

beta2=0.98

Kingsleyandher May 22, 2023

We used the same configuration to achieve results consistent with Adamw as baseline on the 4B parameter model

Kingsleyandher May 22, 2023

micro batch size: 4
global batic size: 32
lr_decay_style: cosine
warmup steps: 3200

xiangning-chen May 22, 2023

Hi, thanks for trying out!

Are you referring to when the learning rate increases to 3e-6 in the warmup phase, the model training explodes?
Have you encountered a similar issue with convergence while employing AdamW to train the 16B model as well?

lapp0 · 2024-08-11T05:34:03Z

lapp0
Aug 11, 2024

Seeing much better results in my gpt2 distillation test with lion than any other optimizer. Loss function is KL divergence of logits and hidden states between student model and teacher model. Uses default parameters here only adjusting the optimizer

lion_32bit > lion_8bit > adamw_8bit ≈ adamw_32bit > all other optimizers

Results:

https://huggingface.co/lapp0/distily_bench_gpt2_optim_extended2/tensorboard

0 replies

discuss whether it worked or didn't work #1

lucidrains Feb 15, 2023 Maintainer

Replies: 18 comments · 50 replies

lucidrains Feb 16, 2023 Maintainer Author

lucidrains Feb 16, 2023 Maintainer Author

lucidrains Feb 17, 2023 Maintainer Author

lucidrains Feb 17, 2023 Maintainer Author

lucidrains Feb 18, 2023 Maintainer Author

lucidrains Feb 17, 2023 Maintainer Author

lucidrains Feb 17, 2023 Maintainer Author

lucidrains Feb 18, 2023 Maintainer Author

lucidrains Feb 21, 2023 Maintainer Author

lucidrains Feb 18, 2023 Maintainer Author

lucidrains Feb 19, 2023 Maintainer Author

lucidrains Feb 21, 2023 Maintainer Author

lucidrains Feb 23, 2023 Maintainer Author

lucidrains Feb 28, 2023 Maintainer Author

lucidrains Mar 8, 2023 Maintainer Author

lucidrains Mar 1, 2023 Maintainer Author

lucidrains Mar 2, 2023 Maintainer Author

lucidrains Mar 2, 2023 Maintainer Author

lucidrains Mar 2, 2023 Maintainer Author

lucidrains Mar 2, 2023 Maintainer Author

lucidrains Mar 4, 2023 Maintainer Author

lucidrains Mar 4, 2023 Maintainer Author

lucidrains
Feb 15, 2023
Maintainer

Replies: 18 comments 50 replies

lucidrains
Feb 16, 2023
Maintainer Author

lucidrains
Feb 16, 2023
Maintainer Author

lucidrains Feb 17, 2023
Maintainer Author

lucidrains Feb 17, 2023
Maintainer Author

lucidrains Feb 18, 2023
Maintainer Author

lucidrains
Feb 17, 2023
Maintainer Author

lucidrains
Feb 17, 2023
Maintainer Author

lucidrains
Feb 18, 2023
Maintainer Author

lucidrains Feb 21, 2023
Maintainer Author

lucidrains
Feb 18, 2023
Maintainer Author

lucidrains
Feb 19, 2023
Maintainer Author

lucidrains
Feb 21, 2023
Maintainer Author

lucidrains
Feb 23, 2023
Maintainer Author

lucidrains Feb 28, 2023
Maintainer Author

lucidrains Mar 8, 2023
Maintainer Author

lucidrains
Mar 1, 2023
Maintainer Author

lucidrains Mar 2, 2023
Maintainer Author

lucidrains Mar 2, 2023
Maintainer Author

lucidrains Mar 2, 2023
Maintainer Author

lucidrains
Mar 2, 2023
Maintainer Author

lucidrains
Mar 4, 2023
Maintainer Author

lucidrains Mar 4, 2023
Maintainer Author