Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split token_embs and lm_head weights #2252

Merged
merged 20 commits into from
Sep 5, 2024
Merged

Split token_embs and lm_head weights #2252

merged 20 commits into from
Sep 5, 2024

Conversation

irexyc
Copy link
Collaborator

@irexyc irexyc commented Aug 7, 2024

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

0, // only used for position encoding
token_num,
token_num,
1,
hidden_units_,
stream_);
if (tensor_para_.world_size_ > 1) {
NcclGuard nccl_guard(tensor_para_, stream_);
ftNcclAllReduceSum(decoder_input, decoder_input, token_num * hidden_units_, tensor_para_, stream_);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use allreduce? Shouldn't we use allgather?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If using allgather, the weight and buffer layout should change and after allgather it need another permute op.

And another reason which confuses me is in my test, the reduce sum is faster than allgather.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lzhangzz what's the permute op in turbomind
In my opinion, it is more reasonable to use 'allgather' than to use 'allreduce'.
@lzhangzz do you have any idea about "the reduce sum is faster than allgather"

@lvhan028 lvhan028 requested a review from lzhangzz August 9, 2024 06:49
@lzhangzz
Copy link
Collaborator

lzhangzz commented Aug 13, 2024

We need to benchmark the ar/ag case on different systems (NVLink/PCIe) first.

https://github.com/NVIDIA/nccl-tests

@irexyc
Copy link
Collaborator Author

irexyc commented Aug 14, 2024

./build/all_reduce_perf -g 4 -b 64M -e 64M -n 100 -N 10
# Avg bus bandwidth    : 191.6
./build/all_gather_perf -g 4 -b 64M -e 64M -n 100 -N 10
# Avg bus bandwidth    : 163.346

@lzhangzz
Copy link
Collaborator

@irexyc bus bandwidth of all-reduce and all-gather is computed differently.

@irexyc
Copy link
Collaborator Author

irexyc commented Aug 26, 2024

image

@lvhan028
Copy link
Collaborator

pr_ete_test shows the following case failed.

CUDA_VISIBLE_DEVICES=5,6 lmdeploy serve api_server /nvme/qa_test_models/autotest_model/workspace_internlm/internlm2-chat-20b --session-len 8096 --server-port 23333 --tp 2

// check converted file with tp
auto should_exist = prefix + "." + std::to_string(tensor_para_size - 1) + ".weight";
auto should_not_exist = prefix + "." + std::to_string(tensor_para_size) + ".weight";
if (!std::filesystem::exists(should_exist) || std::filesystem::exists(should_not_exist)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can put TP check in turbomind.py

src/turbomind/models/llama/LlamaV2.cc Show resolved Hide resolved
embedding_table_size_ = vocab_size_padded_ * hidden_units_ / tensor_para_size_;
}
else {
embedding_table_size_ = vocab_size_padded_ * hidden_units_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case expected to work correctly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't find models that hidden size % 8 != 0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we just FT_CHECK the condition here and eliminate the else branch since it's not going to work when hidden_units_ % tensor_para_size_ != 0

Copy link
Collaborator

@lzhangzz lzhangzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lvhan028 lvhan028 merged commit eab8536 into InternLM:main Sep 5, 2024
9 checks passed
@lvhan028 lvhan028 mentioned this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants