Enable distributed LoRA training #821

angeloskath · 2024-06-06T15:04:25Z

The updates to LORA.md are missing but TL;DR we can now do

$ echo "m2-ultra-0 slots=1" >>hostfile
$ echo "m2-ultra-1 slots=1" >>hostfile
$ mpirun --hostfile hostfile -- python -m mlx_lm.lora --train --model mlx-community/Mistral-7B-v0.2-4bit --data /path/to/data --batch-size 16

to train across two nodes (or more really nothing needs to change).

mzbac · 2024-07-07T06:44:49Z

Is that possible to do distributed inference as well?

awni · 2024-07-07T13:26:02Z

Is that possible to do distributed inference as well?

Possible yes, but getting a nice speedup is more challenging. That's something we're looking at, but don't have an ETA on right now.

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

angeloskath · 2024-10-31T22:53:45Z

@awni feel free to review and then we can merge. I split the launcher to a different branch.

awni · 2024-11-01T17:33:08Z

llms/tests/test_finetune.py

-            loss=mock_default_loss,
-            iterate_batches=mock_iterate_batches,
-        )
+        with swapped_with_identity(mx.distributed, "all_sum"):


Just curious, why do we need this for the test to work?

Nvm I see the magicmock thing is messing up the all_sum.

awni · 2024-11-01T17:34:31Z

llms/mlx_lm/tuner/trainer.py

+                    f"Val loss {val_loss:.3f}, "
+                    f"Val took {val_time:.3f}s",
+                    flush=True,
+                )

            if training_callback is not None:


Probably tha tshould go under the rank==0 condition as well

Yeah ok, it makes sense. I was thinking in general callbacks should always run and if the callback is about reporting then it can choose to only run on rank=0. But our callbacks here are only about reporting so it makes sense to just run them only on node 0.

It's a good point actually. It's more flexible that way. I'm on board leaving it to the user to specify the rank.

awni · 2024-11-01T17:35:05Z

llms/mlx_lm/tuner/trainer.py

+                    f"Trained Tokens {trained_tokens}, "
+                    f"Peak mem {peak_mem:.3f} GB",
+                    flush=True,
+                )

            if training_callback is not None:


Same there.

awni

Looks great!! Let's 🚢

ivanfioravanti · 2024-11-17T12:46:29Z

This works perfectly! Great job 👏

angeloskath force-pushed the distributed-lora branch from 7685977 to 0466b05 Compare July 9, 2024 00:55

yeahdongcn added a commit to yeahdongcn/mlx-examples that referenced this pull request Aug 6, 2024

https://github.com/ml-explore/mlx-examples/pull/821

e3de823

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

angeloskath force-pushed the distributed-lora branch from 0466b05 to bbdf210 Compare September 12, 2024 06:42

angeloskath mentioned this pull request Sep 12, 2024

Data parallel helper ml-explore/mlx#1407

Merged

angeloskath requested a review from awni September 12, 2024 23:06

angeloskath force-pushed the distributed-lora branch from 144e8a0 to 5050765 Compare October 24, 2024 08:35

angeloskath added 4 commits October 31, 2024 15:41

Add distributed option for lora training

4786b4e

Use concatenated all reduce and gather stats

e0f18d1

Flush the messages

b0a42d0

Remove tree_map import

ece20f1

angeloskath force-pushed the distributed-lora branch from 5050765 to ece20f1 Compare October 31, 2024 22:52

Fix the test

c5e09a1

awni reviewed Nov 1, 2024

View reviewed changes

awni approved these changes Nov 1, 2024

View reviewed changes

angeloskath force-pushed the distributed-lora branch from 3cd80c4 to c5e09a1 Compare November 3, 2024 01:01

angeloskath merged commit 331148d into main Nov 3, 2024
2 checks passed

angeloskath deleted the distributed-lora branch November 3, 2024 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable distributed LoRA training #821

Enable distributed LoRA training #821

angeloskath commented Jun 6, 2024

mzbac commented Jul 7, 2024

awni commented Jul 7, 2024

angeloskath commented Oct 31, 2024

awni Nov 1, 2024

awni Nov 1, 2024

awni Nov 1, 2024

angeloskath Nov 1, 2024

awni Nov 1, 2024

awni Nov 1, 2024

awni left a comment

ivanfioravanti commented Nov 17, 2024

Enable distributed LoRA training #821

Enable distributed LoRA training #821

Conversation

angeloskath commented Jun 6, 2024

mzbac commented Jul 7, 2024

awni commented Jul 7, 2024

angeloskath commented Oct 31, 2024

awni Nov 1, 2024

Choose a reason for hiding this comment

awni Nov 1, 2024

Choose a reason for hiding this comment

awni Nov 1, 2024

Choose a reason for hiding this comment

angeloskath Nov 1, 2024

Choose a reason for hiding this comment

awni Nov 1, 2024

Choose a reason for hiding this comment

awni Nov 1, 2024

Choose a reason for hiding this comment

awni left a comment

Choose a reason for hiding this comment

ivanfioravanti commented Nov 17, 2024