-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move FP8 TE export logic to mcore.export #11409
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Piotr Kaminski <pikaminski@nvidia.com>
Signed-off-by: Laplasjan107 <Laplasjan107@users.noreply.github.com>
Signed-off-by: Piotr Kamiński <67481570+Laplasjan107@users.noreply.github.com>
Signed-off-by: Piotr Kaminski <piotrus.kaminski@gmail.com>
Signed-off-by: Laplasjan107 <Laplasjan107@users.noreply.github.com>
Signed-off-by: Piotr Kamiński <67481570+Laplasjan107@users.noreply.github.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
ce002a8
to
71292d4
Compare
Signed-off-by: Piotr Kaminski <piotrus.kaminski@gmail.com>
Signed-off-by: Piotr Kaminski <piotrus.kaminski@gmail.com>
Signed-off-by: Laplasjan107 <Laplasjan107@users.noreply.github.com>
Signed-off-by: Piotr Kamiński <67481570+Laplasjan107@users.noreply.github.com>
nemo/export/tensorrt_llm.py
Outdated
@@ -83,6 +83,17 @@ def wrapper(*args, **kwargs): | |||
use_pytriton = False | |||
|
|||
|
|||
def determine_quantization_settings( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add docstrings for all new / edited functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(same remark for typing)
nemo/export/tensorrt_llm.py
Outdated
@@ -353,17 +366,38 @@ def export( | |||
from megatron.core.export.trtllm.trtllm_helper import TRTLLMHelper | |||
from tensorrt_llm.layers import MoeConfig | |||
|
|||
use_embedding_sharing = model_configs.get("share_embeddings_and_output_weights", False) | |||
fp8_quantized, fp8_kvcache = determine_quantization_settings( | |||
model_configs, fp8_quantized, fp8_kvcache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's model_configs
(plural) passed, but only 1x nemo_model_config
consumed in determine_quantization_settings
- could we unify?
@@ -524,7 +559,7 @@ def get_transformer_config(self, nemo_model_config): | |||
ffn_hidden_size=nemo_model_config.get('ffn_hidden_size'), | |||
layernorm_epsilon=nemo_model_config.get('layernorm_epsilon'), | |||
add_bias_linear=nemo_model_config.get('bias'), | |||
num_moe_experts=nemo_model_config.get('num_moe_experts', None), | |||
num_moe_experts=num_moe_experts if num_moe_experts > 0 else None, | |||
normalization=transformer_config_normalization, | |||
layernorm_zero_centered_gamma=layernorm_zero_centered_gamma, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #11614 there is also gated_linear_unit=nemo_model_config.get('gated_linear_unit'),
added.
- Can we add it here? So that we merge only this MR to offload CI slightly.
- Can we write
nemo_model_config.get('gated_linear_unit', False)
to conform with the typing? There could be suprises maybe, seeNone
vs0
fornum_moe_experts
Please fill in the MR Usage section with details on |
nemo/export/tensorrt_llm.py
Outdated
@@ -353,17 +366,38 @@ def export( | |||
from megatron.core.export.trtllm.trtllm_helper import TRTLLMHelper | |||
from tensorrt_llm.layers import MoeConfig | |||
|
|||
use_embedding_sharing = model_configs.get("share_embeddings_and_output_weights", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overrides parameter use_embedding_sharing
passed to TensorRTLLM.export(...)
. Should it be removed from signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: how about keeping the original name share_embeddings_and_output_weights
?
bytes_list = [state_dict[keyname][0] for keyname in keynames] | ||
return load_scales_from_bytes(bytes_list) | ||
decomposed_sharded_key = key.split('/') | ||
if not len(decomposed_sharded_key): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always len(decomposed_sharded_key) >= 1
so this if
is inactive?
Signed-off-by: Piotr Kamiński <67481570+Laplasjan107@users.noreply.github.com>
Signed-off-by: Piotr Kaminski <piotrus.kaminski@gmail.com>
Signed-off-by: Laplasjan107 <Laplasjan107@users.noreply.github.com>
Signed-off-by: Piotr Kamiński <67481570+Laplasjan107@users.noreply.github.com>
beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base. Your code was analyzed with PyLint. The following annotations have been identified:
Mitigation guide:
By applying these rules, we reduce the occurance of this message in future. Thank you for improving NeMo's documentation! |
What does this PR do ?
Support FP8 TE TRT-LLM export with mcore path
Collection: llm/nlp
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information