Refactor func load_model to class ModelLoader #1909

MengqingCao · 2024-09-12T12:31:07Z

Description

part of #1758

This PR refactor the func load_model in src/axolotl/utils/models.py into a class ModelLoader. Different member functions of class ModelLoader are separated according to their features, and all the member vars of ModelLoader are shared in these funcs. Moreover, this refactoring make the pipeline of model loading more clearly.

TODO:

add UT for ModelLoader

Mainly changes are listed here:

organize comman var into member var of class ModelLoader
split operations in load_model into separate member funcs
refactor cfg.load_in_Xbit to kwarg

The UML of ModelLoader:

Motivation and Context

Why is this change required?

As the models loaded in Axolotl support more and more features, the func load_model is huge now. And this results in confusion about variable changes when abstracting part of func load_model (#1758 (review)). Refactoring load_model will optimize the code structure and facilitate stable evolution when introducing more features in the future.

How has this been tested?

part of UTs for ModelLoader is added and tests passed
I tested to funtune, inference (on both terminal and gradio webui) on open_llama_3b_v2 model, and here comes the screenshot of inferencing:

However, I don't have access to Ampere or newer GPU, thus I cannot pass the UT on my local machine. It would be nice if all UTs could be tested on CI.

winglian · 2024-10-14T13:39:55Z

@MengqingCao this is on our list to tackle this week to get merged in. We'll need to get this rebased.

MengqingCao · 2024-10-14T13:43:33Z

Thanks! I will do the rebase work soon.

…

MengqingCao · 2024-10-17T12:16:24Z

@winglian sorry for a little delay. Now the rebase is done, please review it.

BTW, I basically copy the original code to make the review a little easier.
In the future, maybe refactoring more if-else branches could be done step by step, which will require more models and tests under different configuration conditions.

MengqingCao · 2024-10-17T13:41:10Z

I will fix problems checked out by lint and other CI tests ASAP

…

---- Replied Message ---- | From | ***@***.***> | | Date | 10/14/2024 21:43 | | To | ***@***.***> | | Cc | | | Subject | Re: [axolotl-ai-cloud/axolotl] Refactor func load_model to class ModelLoader (PR #1909) | Thanks! I will do the rebase work soon.

* organize comman var into member var of class ModelLoader * split operations in load_model into separate member funcs * refactor cfg.load_in_Xbit to kwarg

MengqingCao · 2024-10-18T02:19:59Z

Hi @winglian, the code has updated, plz retrigger the CI, thanks!

MengqingCao · 2024-10-18T08:54:57Z

I'm confused why test failed on tests/test_prompt_tokenizers.py and tests/test_validation.py, because everything goes well on my machine. Could you give me some advice? @NanoCode012

NanoCode012 · 2024-10-18T10:36:51Z

@MengqingCao , from the tests, it may be erroring due to this below.

ModuleNotFoundError: No module named 'flash_attn'

Let me see what should be done in a bit.

winglian · 2024-10-18T21:06:40Z

tests/utils/test_models.py

+    @pytest.mark.parametrize("embedding_modules", ["embed_tokens", "lm_head"])
+    @pytest.mark.parametrize(
+        "dist_dtype", [torch.bfloat16, torch.float16, torch.float32]
+    )
+    @pytest.mark.parametrize("before_kbit_train_or_finetune", [True, False])
+    def test_convert_embedding_modules_dtype(
+        self, embedding_modules, dist_dtype, before_kbit_train_or_finetune
+    ):
+        tokenizer = load_tokenizer(self.cfg)
+        self.model_loader.model, _ = load_model(self.cfg, tokenizer, inference=False)
+
+        self.model_loader.convert_embedding_modules_dtype(
+            embedding_modules, dist_dtype, before_kbit_train_or_finetune
+        )
+        for name, module in self.model_loader.model.named_modules():
+            if (
+                "norm" in name
+                or (before_kbit_train_or_finetune and name.endswith(".gate"))
+                or (
+                    any(m in name for m in embedding_modules)
+                    and hasattr(module, "weight")
+                )
+            ):
+                for _, param in module.named_parameters(recurse=False):
+                    assert param.dtype == dist_dtype


let's move thjis one to it's own e2e/ test that runs on a GPU instance. I believe it's ooming

or let's use a config fixture that uses a much smaller model like a 68M parameter model

@winglian @NanoCode012 Thanks for your help. I have moved it to e2e now. Retrigger the CI plz to check if this fix work.

MengqingCao · 2024-10-22T07:24:50Z

@winglian I spent some time fixing the failed UTs and found that load_cfg breaks the caplog, which causes these UTs to fail. The latest UTs just use DictDefault to create cfg to fix it. Could you please retrigger the CI again to verify the current code？

NanoCode012 · 2024-10-22T07:39:26Z

@winglian I spent some time fixing the failed UTs and found that load_cfg breaks the caplog, which causes these UTs to fail. The latest UTs just use DictDefault to create cfg to fix it. Could you please retrigger the CI again to verify the current code？

That's a nice catch. Been debugging it yesterday and couldn't figure out exactly why it failed when all tests are ran together. I suspected caplog but when I tried using capsys, it failed too..

I re-triggered the CI, and they are passing so far.

MengqingCao · 2024-10-22T07:51:23Z

That's a nice catch. Been debugging it yesterday and couldn't figure out exactly why it failed when all tests are ran together. I suspected caplog but when I tried using capsys, it failed too..

It's too hidden to determine the cause, and it fails from the moment it imports load_cfg. I guess the failure may caused by accessing the resources on the hub when calling check_remote_config, but unfortunately I can't be sure.

MengqingCao · 2024-10-22T14:03:34Z

I tested llama-68m model on my machine, and it raised AssertionFailed error. I'll try to fix it tommorow

…

---- Replied Message ---- | From | ***@***.***> | | Date | 10/22/2024 15:39 | | To | ***@***.***> | | Cc | Mengqing ***@***.***>***@***.***> | | Subject | Re: [axolotl-ai-cloud/axolotl] Refactor func load_model to class ModelLoader (PR #1909) | @winglian I spent some time fixing the failed UTs and found that load_cfg breaks the caplog, which causes these UTs to fail. The latest UTs just use DictDefault to create cfg to fix it. Could you please retrigger the CI again to verify the current code？ That's a nice catch. Been debugging it yesterday and couldn't figure out exactly why it failed when all tests are ran together. I suspected caplog but when I tried using capsys, it failed too.. I re-triggered the CI, and they are passing so far. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

MengqingCao · 2024-10-23T01:50:49Z

@winglian @NanoCode012 Thanks a lot for your work! All UTs in test_load_model.py pass now.

Since quantized parameters cannot be converted to data types simply via .to(dist_dtype), this has nothing to do with the correctness of convert_embedding_modules_dtype, so turn off load_in_Xbit to make UT pass.

I should have used a small parameter model first so that I could test it locally at first instead of oom...

src/axolotl/utils/models.py

MengqingCao mentioned this pull request Sep 12, 2024

Enable Ascend NPU support #1758

Merged

MengqingCao force-pushed the model_loader branch from 7987524 to 3df89db Compare October 17, 2024 12:10

Refactor func load_model to class ModelLoader

a714e99

* organize comman var into member var of class ModelLoader * split operations in load_model into separate member funcs * refactor cfg.load_in_Xbit to kwarg

MengqingCao force-pushed the model_loader branch from 224c73a to a714e99 Compare October 18, 2024 02:17

NanoCode012 self-assigned this Oct 18, 2024

MengqingCao added 2 commits October 18, 2024 15:53

Merge branch 'main' into model_loader

0d3e68c

fix ut

eceb86d

winglian reviewed Oct 18, 2024

View reviewed changes

MengqingCao added 3 commits October 19, 2024 20:39

move heavy test to e2e

4b449ac

fix ut: rm load_cfg

3931af5

Merge branch 'main' into model_loader

a99be6b

winglian and others added 3 commits October 22, 2024 09:47

reduce parallelism of tests

747372e

fix: change base_model to use a smaller one

89cf171

fix: update to use smaller base_model

ad4cac1

fix: disable load_in_xbit

4eb929d

NanoCode012 reviewed Oct 25, 2024

View reviewed changes

src/axolotl/utils/models.py Outdated Show resolved Hide resolved

MengqingCao commented Oct 25, 2024

View reviewed changes

src/axolotl/utils/models.py Outdated Show resolved Hide resolved

refactor load_in_xbit for readability

7bea435

NanoCode012 approved these changes Oct 25, 2024

View reviewed changes

winglian merged commit 1d6a5e2 into axolotl-ai-cloud:main Oct 25, 2024
12 checks passed

bursteratom pushed a commit that referenced this pull request Nov 18, 2024

Refactor func load_model to class ModelLoader (#1909)

79b560b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor func load_model to class ModelLoader #1909

Refactor func load_model to class ModelLoader #1909

MengqingCao commented Sep 12, 2024 •

edited

Loading

winglian commented Oct 14, 2024

MengqingCao commented Oct 14, 2024 via email

MengqingCao commented Oct 17, 2024

MengqingCao commented Oct 17, 2024 via email

MengqingCao commented Oct 18, 2024

MengqingCao commented Oct 18, 2024 •

edited

Loading

NanoCode012 commented Oct 18, 2024

winglian Oct 18, 2024

winglian Oct 18, 2024

MengqingCao Oct 19, 2024

MengqingCao commented Oct 22, 2024

NanoCode012 commented Oct 22, 2024

MengqingCao commented Oct 22, 2024

MengqingCao commented Oct 22, 2024 via email

MengqingCao commented Oct 23, 2024

Refactor func load_model to class ModelLoader #1909

Refactor func load_model to class ModelLoader #1909

Conversation

MengqingCao commented Sep 12, 2024 • edited Loading

Description

Motivation and Context

How has this been tested?

winglian commented Oct 14, 2024

MengqingCao commented Oct 14, 2024 via email

MengqingCao commented Oct 17, 2024

MengqingCao commented Oct 17, 2024 via email

MengqingCao commented Oct 18, 2024

MengqingCao commented Oct 18, 2024 • edited Loading

NanoCode012 commented Oct 18, 2024

winglian Oct 18, 2024

Choose a reason for hiding this comment

winglian Oct 18, 2024

Choose a reason for hiding this comment

MengqingCao Oct 19, 2024

Choose a reason for hiding this comment

MengqingCao commented Oct 22, 2024

NanoCode012 commented Oct 22, 2024

MengqingCao commented Oct 22, 2024

MengqingCao commented Oct 22, 2024 via email

MengqingCao commented Oct 23, 2024

MengqingCao commented Sep 12, 2024 •

edited

Loading

MengqingCao commented Oct 18, 2024 •

edited

Loading