New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add GLM-4 and Later GLM Model (Draft) #31977

Closed

zRzRzRzRzRzRzR wants to merge 86 commits into huggingface:main from zRzRzRzRzRzRzR:glm-4

Contributor

zRzRzRzRzRzRzR commented Jul 15, 2024

This is a draft and we will continue work

Did you read the contributor guideline,
Pull Request section?
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

zRzRzRzRzRzRzR and others added 20 commits

July 11, 2024 14:46


          add GLM-4

9cf74d7


          GLM-4 FastTokenizer

bef7fd9


          tokenizer fix

c986fac


          rename

2da5d32


          pad token

675e7a1


          Merge branch 'huggingface:main' into glm-4

304e4ef


          Merge branch 'huggingface:main' into glm-4

0b241f2


          Fix past_key_values

fa44041


          Merge branch 'huggingface:main' into glm-4

24dec6b


          Merge branch 'glm-4' of github.com:zRzRzRzRzRzRzR/transformers into g…

5d2bf5e

…lm-4


          Fix flash attention

63d49c9

Support Cache class


          Merge branch 'huggingface:main' into glm-4

0a5adf3


          add update

51cbf5d


          Merge branch 'glm-4' of https://github.com/zRzRzRzRzRzRzR/transformers …

86b5004

…into glm-4


          test with glm

9a553e5


          fix test

4d45b21


          add discription

85cfe41


          Merge branch 'huggingface:main' into glm-4

860c7ee


          update glm

c83ec2d


          Merge branch 'huggingface:main' into glm-4

xianbaoqian commented Jul 18, 2024

Hi @zRzRzRzRzRzRzR ! Thanks for drafting the PR. The workflow has been failing due to the usage of TikToken. Once the converter scripts converts tiktoken configuration to HF tokenizer configuration, you won't need to import tiktoken during inference at tokenization_glm.py

xianbaoqian requested a review from ArthurZucker

July 18, 2024 06:05

zRzRzRzRzRzRzR added 2 commits

July 18, 2024 15:36


          Merge branch 'huggingface:main' into glm-4


          rewrite tokenizer

3f0452e

Collaborator

ArthurZucker commented Jul 18, 2024

YEP! Will review today 🤗

zRzRzRzRzRzRzR added 3 commits

July 19, 2024 15:32


          Merge branch 'huggingface:main' into glm-4

33d2ca3


          fix some test

084988e


          fix testing

0cb1531

Contributor Author

zRzRzRzRzRzRzR commented Jul 19, 2024

Hi @zRzRzRzRzRzRzR ! Thanks for drafting the PR. The workflow has been failing due to the usage of TikToken. Once the converter scripts converts tiktoken configuration to HF tokenizer configuration, you won't need to import tiktoken during inference at tokenization_glm.py

Fix this issue now~ Tks


          Fix RMSNorm initialization

e49718f

Fix attention mask for right padding

zRzRzRzRzRzRzR added 12 commits

July 25, 2024 22:31

fix

073b811


          Merge branch 'huggingface:main' into glm-4

c0e6ae9


          fix glm dummy

6ac085f


          Merge branch 'glm-4' of https://github.com/zRzRzRzRzRzRzR/transformers …

f140603

…into glm-4


          add doc

65f471d


          fix init

7ad819f


          Update __init__.py

f86af8e


          Update dummy_vision_objects.py

c179377


          add_start_docstrings

41338d7


          fix GLM_START_DOCSTRING

dba6d1e

82b0c7f


          Update perf_infer_gpu_one.md

a6b6f4e

zRzRzRzRzRzRzR mentioned this pull request

建议可以将trust_remote_code设置为false THUDM/GLM-4#397

Closed

ArthurZucker self-requested a review

July 26, 2024 09:31


          Merge branch 'huggingface:main' into glm-4

d1a5ee1

ArthurZucker reviewed

View reviewed changes

Collaborator

ArthurZucker left a comment

I am stopping the review as a LOT of the comments are still not adressed

src/transformers/models/glm/tokenization_glm.py Outdated

Collaborator

ArthurZucker Jul 26, 2024

this file should be removed as we can map the GPT2Tokenizer direct and use it

src/transformers/models/glm/tokenization_glm_fast.py

Collaborator

ArthurZucker Jul 26, 2024

same comment here, we can use GPT2TokenizerFast!

src/transformers/models/glm/configuration_glm.py

		logger = logging.get_logger(__name__)


		class GLMConfig(PretrainedConfig):

Collaborator

ArthurZucker Jul 26, 2024

There is still the issue with the camel casing!

Contributor Author

zRzRzRzRzRzRzR Jul 27, 2024

GLM is the name of our model, not Glm. Do we need to stick to camel case in this context as well?

Collaborator

ArthurZucker Aug 1, 2024

Yes, it's the same for LLaMa which we set to Llama!

src/transformers/models/glm/configuration_glm.py Outdated

Comment on lines 27 to 28

		This is the configuration class to store the configuration of a [`GLMModel`]. It is used to instantiate a Phi-3
		model according to the specified arguments, defining the model architecture. Instantiating a configuration with the

Collaborator

ArthurZucker Jul 26, 2024

Suggested change

      
                This is the configuration class to store the configuration of a [`GLMModel`]. It is used to instantiate a Phi-3
          
                model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
          
                This is the configuration class to store the configuration of a [`GLMModel`]. It is used to instantiate a GLM
          
                model according to the specified arguments, defining the model architecture. Instantiating a configuration with the

Contributor Author

zRzRzRzRzRzRzR Jul 27, 2024

Fix

src/transformers/models/glm/modeling_glm.py Outdated

Comment on lines 47 to 51

+              if is_flash_attn_2_available():
+                  from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
+                  from flash_attn import flash_attn_func, flash_attn_varlen_func
+                  _flash_supports_window_size = "window_size" in list(inspect.signature(flash_attn_func).parameters)

Collaborator

ArthurZucker Jul 26, 2024

again this was refactored

Contributor Author

zRzRzRzRzRzRzR Jul 27, 2024

Fix now

src/transformers/models/glm/modeling_glm.py Outdated

Comment on lines 59 to 68

+              def _get_unpad_data(attention_mask):
+                  seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
+                  indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
+                  max_seqlen_in_batch = seqlens_in_batch.max().item()
+                  cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
+                  return (
+                      indices,
+                      cu_seqlens,
+                      max_seqlen_in_batch,
+                  )

Collaborator

ArthurZucker Jul 26, 2024

same comment here

Contributor Author

zRzRzRzRzRzRzR Jul 27, 2024

Fix

src/transformers/models/glm/modeling_glm.py Outdated

Comment on lines 89 to 130

+              class GLMRotaryEmbedding(nn.Module):
+                  def __init__(self, dim, rope_theta=1, original_impl=False, device=None, dtype=None):
+                      super().__init__()
+                      inv_freq = 1.0 / (10000 ** (torch.arange(0, dim, 2, device=device).to(dtype=dtype) / dim))
+                      self.register_buffer("inv_freq", inv_freq)
+                      self.dim = dim
+                      self.original_impl = original_impl
+                      self.rope_theta = rope_theta
+                  def forward_impl(
+                      self,
+                      seq_len: int,
+                      n_elem: int,
+                      dtype: torch.dtype,
+                      device: torch.device,
+                      base: int = 10000,
+                  ):
+                      """Enhanced Transformer with Rotary Position Embedding.
+                      Derived from: https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/labml_nn/
+                      transformers/rope/__init__.py. MIT License:
+                      https://github.com/labmlai/annotated_deep_learning_paper_implementations/blob/master/license.
+                      """
+                      # $\Theta = {\theta_i = 10000^{\frac{2(i-1)}{d}}, i \in [1, 2, ..., \frac{d}{2}]}$
+                      base = base * self.rope_theta
+                      theta = 1.0 / (base ** (torch.arange(0, n_elem, 2, dtype=torch.float, device=device) / n_elem))
+                      # Create position indexes `[0, 1, ..., seq_len - 1]`
+                      seq_idx = torch.arange(seq_len, dtype=torch.float, device=device)
+                      # Calculate the product of position index and $\theta_i$
+                      idx_theta = torch.outer(seq_idx, theta).float()
+                      cache = torch.stack([torch.cos(idx_theta), torch.sin(idx_theta)], dim=-1).to(dtype=dtype)
+                      return cache
+                  def forward(self, max_seq_len, offset=0):
+                      return self.forward_impl(
+                          max_seq_len,
+                          self.dim,
+                          dtype=self.inv_freq.dtype,
+                          device=self.inv_freq.device,
+                      )

Collaborator

ArthurZucker Jul 26, 2024

Again same comment here, this is equivalent to LlamaRotaryEmbedidng

src/transformers/models/glm/modeling_glm.py Outdated

		return tensor_list


		class SelfAttention(torch.nn.Module):

Collaborator

ArthurZucker Jul 26, 2024

This comment is still waiting!

src/transformers/models/glm/modeling_glm.py

Comment on lines +180 to +184

+                      if self.multi_query_attention:
+                          self.num_multi_query_groups_per_partition = self.multi_query_group_num
+                          self.qkv_hidden_size = (
+                              self.projection_size + 2 * self.hidden_size_per_attention_head * self.multi_query_group_num
+                          )

Collaborator

ArthurZucker Jul 26, 2024

again same comment about GQA and MQA

HuggingFaceDocBuilderDev commented Jul 26, 2024

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zRzRzRzRzRzRzR added 7 commits

July 27, 2024 15:07


          Merge branch 'huggingface:main' into glm-4

c99610e


          flash attn

b283adc


          stiil need fix rotary_emb

4cc618e


          fix GLMSelfAttension

b476dd0


          remove _get_unpad_data

aab2386


          fix GLMSelfAttention

550a692


          Merge branch 'huggingface:main' into glm-4

6492ac3

ArthurZucker reviewed

View reviewed changes

Collaborator

ArthurZucker left a comment

🙏🏻

src/transformers/models/glm/configuration_glm.py

		logger = logging.get_logger(__name__)


		class GLMConfig(PretrainedConfig):

Collaborator

ArthurZucker Aug 1, 2024

Yes, it's the same for LLaMa which we set to Llama!

Collaborator

ArthurZucker commented Aug 1, 2024

Feel free to ping me again for a review!

zRzRzRzRzRzRzR added 2 commits

August 9, 2024 14:15


          Merge branch 'huggingface:main' into glm-4

c3d4636


          Merge branch 'huggingface:main' into glm-4

70b7ff4

Member

Cyrilvallez commented Sep 27, 2024 •

edited

Loading

BTW @zRzRzRzRzRzRzR, I took over and am currently adding the model. You can find the new PR here #33823, should be ready pretty soon

zRzRzRzRzRzRzR closed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet