Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Query or Key Found when using nn.TransformerEncoderLayer #30

Open
USBskycrafts opened this issue Sep 25, 2024 · 24 comments
Open

No Query or Key Found when using nn.TransformerEncoderLayer #30

USBskycrafts opened this issue Sep 25, 2024 · 24 comments

Comments

@USBskycrafts
Copy link

there is a warning says "=====>>> Warning by Adam-mini: No Query or Key found. If you are training Transformers......".

The existence of Key and Query is judged by self.wqk_names = {"k_proj.weight", "q_proj.weight", "wq.weight", "wk.weight"}, and there is only a self_attn.in_proj_weight in the nn.TransformerEncoderLayer. So I think more works need to be done to solve this situation.

@zyushun
Copy link
Owner

zyushun commented Sep 25, 2024

Hi @USBskycrafts , it seems that you you are using "fused attention" implementation, which merges q,k,v as a whole big matrix.
We do not support "fused attention" for now, but here is a simple twist:

  1. use the latest version of Adam-mini (version 1.0.4) in PyPI. Try "pip install adam-mini" again if you are using the old version.
  2. In the optimizer, change: self.mlp_names = {} to self.mlp_names = {"self_attn.in_proj_weight"}.

This operation will calculate vmean by the output dimension (or equivalently, by row ) of the the merged QKV matrix. This will cause a slight mismatch with our original design (calculate vmean by head of Q and K, and calculate vmean of V as a whole). I did not try this before but I guess the performance would be similar.

Please have a try and see if it helps.

@buttercutter
Copy link

buttercutter commented Oct 10, 2024

@zyushun : I tried the following, but still not helping. Could you advise ?

optimizer_denoiser.mlp_names = {"self_attn.in_proj_weight"}
optimizer_denoiser.output_names.add('embedding.weight')  # actual output layer name, projection layer is using weight-tying with embedding layer

for i in range(num_layers):
    optimizer_denoiser.wqk_names.add(f'transformer_encoder.layers.{i}.self_attn.in_proj_weight')  # For query, key, and value combined
    optimizer_denoiser.wqk_names.add(f'transformer_decoder.layers.{i}.self_attn.in_proj_weight')  # Another example for decoder
    optimizer_denoiser.wqk_names.add(f'transformer_decoder.layers.{i}.multihead_attn.in_proj_weight')  # Another example for decoder
Adam-mini found the param block with name: embedding.weight
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.in_proj_weight
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.in_proj_bias
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.out_proj.weight
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.out_proj.bias
Adam-mini found the param block with name: transformer_encoder.layers.0.linear1.weight
Adam-mini found the param block with name: transformer_encoder.layers.0.linear1.bias
Adam-mini found the param block with name: transformer_encoder.layers.0.linear2.weight
Adam-mini found the param block with name: transformer_encoder.layers.0.linear2.bias
Adam-mini found the param block with name: transformer_encoder.layers.0.norm1.weight
Adam-mini found the param block with name: transformer_encoder.layers.0.norm1.bias
Adam-mini found the param block with name: transformer_encoder.layers.0.norm2.weight
Adam-mini found the param block with name: transformer_encoder.layers.0.norm2.bias
Adam-mini found the param block with name: transformer_encoder.layers.1.self_attn.in_proj_weight
Adam-mini found the param block with name: transformer_encoder.layers.1.self_attn.in_proj_bias
Adam-mini found the param block with name: transformer_encoder.layers.1.self_attn.out_proj.weight
Adam-mini found the param block with name: transformer_encoder.layers.1.self_attn.out_proj.bias
Adam-mini found the param block with name: transformer_encoder.layers.1.linear1.weight
Adam-mini found the param block with name: transformer_encoder.layers.1.linear1.bias
Adam-mini found the param block with name: transformer_encoder.layers.1.linear2.weight
Adam-mini found the param block with name: transformer_encoder.layers.1.linear2.bias
Adam-mini found the param block with name: transformer_encoder.layers.1.norm1.weight
Adam-mini found the param block with name: transformer_encoder.layers.1.norm1.bias
Adam-mini found the param block with name: transformer_encoder.layers.1.norm2.weight
Adam-mini found the param block with name: transformer_encoder.layers.1.norm2.bias
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.in_proj_weight
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.in_proj_bias
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.out_proj.weight
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.out_proj.bias
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.in_proj_weight
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.in_proj_bias
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.out_proj.weight
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.out_proj.bias
Adam-mini found the param block with name: transformer_decoder.layers.0.linear1.weight
Adam-mini found the param block with name: transformer_decoder.layers.0.linear1.bias
Adam-mini found the param block with name: transformer_decoder.layers.0.linear2.weight
Adam-mini found the param block with name: transformer_decoder.layers.0.linear2.bias
Adam-mini found the param block with name: transformer_decoder.layers.0.norm1.weight
Adam-mini found the param block with name: transformer_decoder.layers.0.norm1.bias
Adam-mini found the param block with name: transformer_decoder.layers.0.norm2.weight
Adam-mini found the param block with name: transformer_decoder.layers.0.norm2.bias
Adam-mini found the param block with name: transformer_decoder.layers.0.norm3.weight
Adam-mini found the param block with name: transformer_decoder.layers.0.norm3.bias
Adam-mini found the param block with name: transformer_decoder.layers.1.self_attn.in_proj_weight
Adam-mini found the param block with name: transformer_decoder.layers.1.self_attn.in_proj_bias
Adam-mini found the param block with name: transformer_decoder.layers.1.self_attn.out_proj.weight
Adam-mini found the param block with name: transformer_decoder.layers.1.self_attn.out_proj.bias
Adam-mini found the param block with name: transformer_decoder.layers.1.multihead_attn.in_proj_weight
Adam-mini found the param block with name: transformer_decoder.layers.1.multihead_attn.in_proj_bias
Adam-mini found the param block with name: transformer_decoder.layers.1.multihead_attn.out_proj.weight
Adam-mini found the param block with name: transformer_decoder.layers.1.multihead_attn.out_proj.bias
Adam-mini found the param block with name: transformer_decoder.layers.1.linear1.weight
Adam-mini found the param block with name: transformer_decoder.layers.1.linear1.bias
Adam-mini found the param block with name: transformer_decoder.layers.1.linear2.weight
Adam-mini found the param block with name: transformer_decoder.layers.1.linear2.bias
Adam-mini found the param block with name: transformer_decoder.layers.1.norm1.weight
Adam-mini found the param block with name: transformer_decoder.layers.1.norm1.bias
Adam-mini found the param block with name: transformer_decoder.layers.1.norm2.weight
Adam-mini found the param block with name: transformer_decoder.layers.1.norm2.bias
Adam-mini found the param block with name: transformer_decoder.layers.1.norm3.weight
Adam-mini found the param block with name: transformer_decoder.layers.1.norm3.bias
Adam-mini found the param block with name: norm.weight
Adam-mini found the param block with name: norm.bias
Adam-mini found the param block with name: projection.bias
Adam-mini found the param block with name: denoise_head.0.bias
Adam-mini found the param block with name: denoise_head.0.parametrizations.weight.original0
Adam-mini found the param block with name: denoise_head.0.parametrizations.weight.original1
Adam-mini found 1 embedding layers, 0 output layers, 0 Querys, Keys, and Values.

@zyushun
Copy link
Owner

zyushun commented Oct 10, 2024

Hi @buttercutter . Thanks for the update. Please try the following.

  1. run "pip install adam-mini" again. This will install the latest version of Adam-mini (version 1.0.5) in PyPI. Note that it will be version 1.0.5, not version 1.0.4 as you previously used.

  2. run the following

        optimizer = Adam_mini(
                    named_parameters = model.named_parameters(),
                    lr = lr,
                    betas = (beta1,beta2),
                    eps = eps,
                    weight_decay = weight_decay,
                    dim = model_config.dim,
                    n_heads = model_config.n_heads,
                    n_kv_heads = model_config.n_kv_heads,
                    )
           optimizer.mlp_names.add("attn") 
           optimizer.mlp_names.add("linear") 

You will still receive the log "Adam-mini found ...., 0 Querys, Keys, and Values." But it is okay you can ignore this.

Please try :)

@zyushun
Copy link
Owner

zyushun commented Oct 10, 2024

 @buttercutter I just realize that you are already using v1.0.5, so you don't need to re-install again. Just add the following two lines after the optimizer.

optimizer.mlp_names.add("attn") 
optimizer.mlp_names.add("linear") 

Please try :)

@buttercutter
Copy link

@zyushun : I added those two lines, then I got this strange error.

   File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/adam_mini/adam_mini.py", line 257, in step
     state["vmean"] = torch.zeros_like(state["m"][0:state["neuron_per_gpu"], 0:1], memory_format=torch.preserve_format)
                                       ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 IndexError: too many indices for tensor of dimension 1

@zyushun
Copy link
Owner

zyushun commented Oct 10, 2024

Hi, @buttercutter . Good to know this...
This error is because the weight matrices (and the associated gradient matrices) in your codebase are all stretched as long vectors. This is a corner case that we are aware of, but we find it is relatively rare to occur. Our current implementation assumes all weight matrices are matrices rather than vectors, so we do not support this case. Sorry for the inconvenience.

A simple twist is to remove these lines.

optimizer.mlp_names.add("attn")  # remove this line 
optimizer.mlp_names.add("linear")  # remove this line

By removing these two lines, Adam-mini will use a single learning rate for each block under pytorch default partition (except for the embedding and output layers, where it will use Adam). There is no guarantee that this would perform well but you can have a try. Usually, it does not work well for pre-training but it can work for finetuning.

@buttercutter
Copy link

@zyushun : Thanks and I really appreciate your prompt reply.

So far, my model training run memory consumption does not decrease.

You will still receive the log "Adam-mini found ...., 0 Querys, Keys, and Values." But it is okay you can ignore this.

optimizer_denoiser.mlp_names = {"self_attn.in_proj_weight"}
optimizer_denoiser.output_names.add('embedding.weight')  # actual output layer name, projection layer is using weight-tying with embedding layer

for i in range(num_layers):
    optimizer_denoiser.wqk_names.add(f'transformer_encoder.layers.{i}.self_attn.in_proj_weight')  # For query, key, and value combined
    optimizer_denoiser.wqk_names.add(f'transformer_decoder.layers.{i}.self_attn.in_proj_weight')  # Another example for decoder
    optimizer_denoiser.wqk_names.add(f'transformer_decoder.layers.{i}.multihead_attn.in_proj_weight')  # Another example for decoder

@zyushun
Copy link
Owner

zyushun commented Oct 10, 2024

Hi @buttercutter You can remove all these lines and try again.

# remove the following
optimizer_denoiser.mlp_names = {"self_attn.in_proj_weight"}
optimizer_denoiser.output_names.add('embedding.weight')  # actual output layer name, projection layer is using weight-tying with embedding layer

for i in range(num_layers):
    optimizer_denoiser.wqk_names.add(f'transformer_encoder.layers.{i}.self_attn.in_proj_weight')  # For query, key, and value combined
    optimizer_denoiser.wqk_names.add(f'transformer_decoder.layers.{i}.self_attn.in_proj_weight')  # Another example for decoder
    optimizer_denoiser.wqk_names.add(f'transformer_decoder.layers.{i}.multihead_attn.in_proj_weight')  # Another example for decoder

"memory consumption does not decrease" this seems weird. Did you use any other orthogonalization tricks to AdamW, like quantization or cpu-offload? Or are you using PagedAdam (which will use quantization)?

@buttercutter
Copy link

@zyushun : Thanks again for your prompt response.

By removing these two lines, Adam-mini will use a single learning rate for each block under pytorch default partition (except for the embedding and output layers, where it will use Adam).

Noted. I increase the model internal dimension size for both the encoders and decoders, yet still seeing no decreasing memory consumption trend.

Did you use any other orthogonalization tricks to AdamW, like quantization or cpu-offload? Or are you using PagedAdam (which will use quantization)?

No. By the way, I am using MPS backend.

@zyushun
Copy link
Owner

zyushun commented Oct 10, 2024

Hi @buttercutter , How large is your model? One possible reason: your model is too small so the embedding + output layer takes the major proportion of memory. In this case, Adam-mini, at least for v.1.0.5, takes the similar memory to Adam.

The advantage of memory-reduction usually becomes significant when model size reaches 1B, where the embedding & output takes <10% of total params , and Adam-mini saves 45% memory over AdamW.

@buttercutter
Copy link

@zyushun

embedding & output takes <10% of total params , and Adam-mini saves 45% memory over AdamW

You are right, embedding + output layer took the most memory as compared to the encoders and decoders in my code.

Adam-mini will use a single learning rate for each block under pytorch default partition (except for the embedding and output layers, where it will use Adam)

Would there be any plan for enabling Adam-Mini on the embedding and output layers ?

@zyushun
Copy link
Owner

zyushun commented Oct 10, 2024

@buttercutter Yes! We have developed a new version of Adam-mini (would be v.1.0.6) and it will also cut down the memory for embedding & output layers to 50%. We will also update the paper soon accordingly.

I will keep you noticed once we updated v.1.0.6. 😄

@buttercutter
Copy link

@zyushun : I noticed that Adam-mini version is currently at v.1.1.0

For the new version, could I skip the following warnings before I check the traceback error ?

 Adam-mini found 0 embedding layers, 0 output layers; 0 Querys and Keys;  0 Values;  0 attn_proj;  0 MLPs;
 =====>>> Warning by Adam-mini: No embedding layer found. If you are training Transformers, please check the name of your embedding layer and manually add them to 'self.embd_names' of Adam-mini. You can do this by  adding an additional line of code: optimizer.embd_names.add('the keywords in the name of your embedding layer').
 =====>>> Warning by Adam-mini: No output layer found. If you are training Transformers (without weight-tying), please check the name of your output layer and manually add them to 'self.output_names' of Adam-mini.  You can do this by adding an additional line of code: optimizer.output_names.add('the keywords in the  name of your output layer').  Please ignore this warning if you are using weight-tying.
 =====>>>  Warning by Adam-mini: No Query or Key found. If you are training Transformers, please check the name of your Query and Key in attention blocks and manually add them to 'self.wqk_names' of Adam-mini. You  can do this by adding two additional lines of code: optimizer.wqk_names.add('the keywords in the  name of your Query' ); optimizer.wqk_names.add('the keywords in the  name of your Key').
 =====>>>  Warning by Adam-mini: No Value found. If you are training Transformers, please check the name of your Value in attention blocks and manually add them to 'self.wv_names' of Adam-mini. You can do this by   adding an additional lines of code: optimizer.wv_names.add('the keywords in the  name of your Value' ).
 =====>>>  Warning by Adam-mini: No attn_proj found. If you are training Transformers, please check the name of your attn_proj in attention blocks and manually add them to 'self.attn_proj_names' of Adam-mini. You   can do this by adding an additional lines of code: optimizer.attn_proj_names.add('the keywords in the  name of your attn_proj' ).
 =====>>>  Warning by Adam-mini: No MLP found. If you are training Transformers, please check the name of your MLP in attention blocks and manually add them to 'self.mlp_names' of Adam-mini. You can do this by      adding an additional lines of code: optimizer.attn_proj_names.add('the keywords in the  name of your MLP' ).
 =====>>>  Warning by Adam-mini: you are using default PyTorch partition for Adam-mini. It can cause training instability on large-scale Transformers.

@zyushun
Copy link
Owner

zyushun commented Oct 20, 2024

Hi @buttercutter, yes, we have updated 1.1.0 version of Adam-mini, which saves memory for the embedding & output layers.
Please add the following lines after creating the optimizer.

optimizer.embd_names.add('embedding') #add the keyword of your embedding layer
optimizer.output_names.add('head') #add the keyword of your output layer

Note that we still assume embedding & output layers are matrices instead of long vectors. So it might raise error if your codebase will automatically reshape the embedding & output parameters to vectors. If this is the case, then you can put embedding and output layer to adam_blocks and use Adam for them (same as before, and there will be no significant memory drop for your small model).

optimizer.adam_block_names.add('embedding') #add the keyword of your embedding layer
optimizer.adam_block_names.add('head') #add the keyword of your output layer

We will try to support the case where "weight matrices are stretched as long vectors" in future versions, perhaps in v.1.1.1.

@buttercutter
Copy link

Hi @zyushun , sorry for overwhelming you with a lot of technical questions.

I had added the following naming instantiation schemes to help Adam_Mini locates the layers, but they are not able to do so according to the warning log.

optimizer_ebm.embd_names.add('embedding') # add the keyword of the embedding layer
optimizer_ebm.output_names.add('denoise_head') # output layer of EBM model is not using projection layer
optimizer_denoiser.embd_names.add('embedding') # add the keyword of the embedding layer
optimizer_denoiser.output_names.add('projection') # projection layer is using weight-tying with embedding layer

optimizer_ebm.mlp_names = {"self_attn"}
optimizer_denoiser.mlp_names = {"self_attn"}

optimizer_ebm.mlp_names.add("attn")
optimizer_ebm.mlp_names.add("linear")
optimizer_denoiser.mlp_names.add("attn")
optimizer_denoiser.mlp_names.add("linear")

optimizer_denoiser.wqk_names.add("self_attn")  # For query, key, and value combined
optimizer_denoiser.wqk_names.add("multihead_attn")
Adam-mini found the param block with name: embedding.weight torch.Size([30522, 32])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.in_proj_weight torch.Size([96, 32])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.in_proj_bias torch.Size([96])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.out_proj.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.out_proj.bias torch.Size([32])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear1.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear1.bias torch.Size([32])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear2.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear2.bias torch.Size([32])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm1.weight torch.Size([32])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm1.bias torch.Size([32])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm2.weight torch.Size([32])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm2.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.in_proj_weight torch.Size([96, 32])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.in_proj_bias torch.Size([96])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.out_proj.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.out_proj.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.in_proj_weight torch.Size([96, 32])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.in_proj_bias torch.Size([96])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.out_proj.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.out_proj.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear1.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear1.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear2.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear2.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm1.weight torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm1.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm2.weight torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm2.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm3.weight torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm3.bias torch.Size([32])
Adam-mini found the param block with name: norm.weight torch.Size([32])
Adam-mini found the param block with name: norm.bias torch.Size([32])
Adam-mini found the param block with name: projection.bias torch.Size([30522])
Adam-mini found the param block with name: denoise_head.0.bias torch.Size([32])
Adam-mini found the param block with name: denoise_head.0.parametrizations.weight.original0 torch.Size([32, 1, 1])
Adam-mini found the param block with name: denoise_head.0.parametrizations.weight.original1 torch.Size([32, 32, 3])
Adam-mini found the param block with name: embedding.weight torch.Size([30522, 128])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.in_proj_weight torch.Size([384, 128])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.in_proj_bias torch.Size([384])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.out_proj.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.out_proj.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear1.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear2.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm1.weight torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm2.weight torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.self_attn.in_proj_weight torch.Size([384, 128])
Adam-mini found the param block with name: transformer_encoder.layers.1.self_attn.in_proj_bias torch.Size([384])
Adam-mini found the param block with name: transformer_encoder.layers.1.self_attn.out_proj.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_encoder.layers.1.self_attn.out_proj.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.linear1.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_encoder.layers.1.linear1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.linear2.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_encoder.layers.1.linear2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.norm1.weight torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.norm1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.norm2.weight torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.norm2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.in_proj_weight torch.Size([384, 128])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.in_proj_bias torch.Size([384])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.out_proj.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.out_proj.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.in_proj_weight torch.Size([384, 128])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.in_proj_bias torch.Size([384])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.out_proj.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.out_proj.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear1.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear2.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm1.weight torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm2.weight torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm3.weight torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm3.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.self_attn.in_proj_weight torch.Size([384, 128])
Adam-mini found the param block with name: transformer_decoder.layers.1.self_attn.in_proj_bias torch.Size([384])
Adam-mini found the param block with name: transformer_decoder.layers.1.self_attn.out_proj.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.1.self_attn.out_proj.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.multihead_attn.in_proj_weight torch.Size([384, 128])
Adam-mini found the param block with name: transformer_decoder.layers.1.multihead_attn.in_proj_bias torch.Size([384])
Adam-mini found the param block with name: transformer_decoder.layers.1.multihead_attn.out_proj.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.1.multihead_attn.out_proj.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.linear1.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.1.linear1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.linear2.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.1.linear2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.norm1.weight torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.norm1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.norm2.weight torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.norm2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.norm3.weight torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.norm3.bias torch.Size([128])
Adam-mini found the param block with name: norm.weight torch.Size([128])
Adam-mini found the param block with name: norm.bias torch.Size([128])
Adam-mini found the param block with name: projection.bias torch.Size([30522])
Adam-mini found the param block with name: denoise_head.0.bias torch.Size([128])
Adam-mini found the param block with name: denoise_head.0.parametrizations.weight.original0 torch.Size([128, 1, 1])
Adam-mini found the param block with name: denoise_head.0.parametrizations.weight.original1 torch.Size([128, 128, 3])
Adam-mini found 0 embedding layers, 0 output layers; 0 Querys and Keys;  0 Values;  0 attn_proj;  0 MLPs;
=====>>> Warning by Adam-mini: No embedding layer found. If you are training Transformers, please check the name of your embedding layer and manually add them to 'self.embd_names' of Adam-mini. You can do this by adding an additional line of code: optimizer.embd_names.add('the keywords in the name of your embedding layer'). 
=====>>> Warning by Adam-mini: No output layer found. If you are training Transformers (without weight-tying), please check the name of your output layer and manually add them to 'self.output_names' of Adam-mini. You can do this by adding an additional line of code: optimizer.output_names.add('the keywords in the  name of your output layer').  Please ignore this warning if you are using weight-tying.
=====>>>  Warning by Adam-mini: No Query or Key found. If you are training Transformers, please check the name of your Query and Key in attention blocks and manually add them to 'self.wqk_names' of Adam-mini. You can do this by adding two additional lines of code: optimizer.wqk_names.add('the keywords in the  name of your Query' ); optimizer.wqk_names.add('the keywords in the  name of your Key'). 
=====>>>  Warning by Adam-mini: No Value found. If you are training Transformers, please check the name of your Value in attention blocks and manually add them to 'self.wv_names' of Adam-mini. You can do this by adding an additional lines of code: optimizer.wv_names.add('the keywords in the  name of your Value' ). 
=====>>>  Warning by Adam-mini: No attn_proj found. If you are training Transformers, please check the name of your attn_proj in attention blocks and manually add them to 'self.attn_proj_names' of Adam-mini. You can do this by adding an additional lines of code: optimizer.attn_proj_names.add('the keywords in the  name of your attn_proj' ). 
=====>>>  Warning by Adam-mini: No MLP found. If you are training Transformers, please check the name of your MLP in attention blocks and manually add them to 'self.mlp_names' of Adam-mini. You can do this by adding an additional lines of code: optimizer.attn_proj_names.add('the keywords in the  name of your MLP' ). 
=====>>>  Warning by Adam-mini: you are using default PyTorch partition for Adam-mini. It can cause training instability on large-scale Transformers.

@Sun2018421
Copy link

Sun2018421 commented Oct 31, 2024

Hi @zyushun , sorry for overwhelming you with a lot of technical questions.

I had added the following naming instantiation schemes to help Adam_Mini locates the layers, but they are not able to do so according to the warning log.

optimizer_ebm.embd_names.add('embedding') # add the keyword of the embedding layer
optimizer_ebm.output_names.add('denoise_head') # output layer of EBM model is not using projection layer
optimizer_denoiser.embd_names.add('embedding') # add the keyword of the embedding layer
optimizer_denoiser.output_names.add('projection') # projection layer is using weight-tying with embedding layer

optimizer_ebm.mlp_names = {"self_attn"}
optimizer_denoiser.mlp_names = {"self_attn"}

optimizer_ebm.mlp_names.add("attn")
optimizer_ebm.mlp_names.add("linear")
optimizer_denoiser.mlp_names.add("attn")
optimizer_denoiser.mlp_names.add("linear")

optimizer_denoiser.wqk_names.add("self_attn")  # For query, key, and value combined
optimizer_denoiser.wqk_names.add("multihead_attn")
Adam-mini found the param block with name: embedding.weight torch.Size([30522, 32])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.in_proj_weight torch.Size([96, 32])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.in_proj_bias torch.Size([96])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.out_proj.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.out_proj.bias torch.Size([32])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear1.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear1.bias torch.Size([32])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear2.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear2.bias torch.Size([32])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm1.weight torch.Size([32])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm1.bias torch.Size([32])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm2.weight torch.Size([32])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm2.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.in_proj_weight torch.Size([96, 32])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.in_proj_bias torch.Size([96])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.out_proj.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.out_proj.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.in_proj_weight torch.Size([96, 32])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.in_proj_bias torch.Size([96])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.out_proj.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.out_proj.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear1.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear1.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear2.weight torch.Size([32, 32])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear2.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm1.weight torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm1.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm2.weight torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm2.bias torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm3.weight torch.Size([32])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm3.bias torch.Size([32])
Adam-mini found the param block with name: norm.weight torch.Size([32])
Adam-mini found the param block with name: norm.bias torch.Size([32])
Adam-mini found the param block with name: projection.bias torch.Size([30522])
Adam-mini found the param block with name: denoise_head.0.bias torch.Size([32])
Adam-mini found the param block with name: denoise_head.0.parametrizations.weight.original0 torch.Size([32, 1, 1])
Adam-mini found the param block with name: denoise_head.0.parametrizations.weight.original1 torch.Size([32, 32, 3])
Adam-mini found the param block with name: embedding.weight torch.Size([30522, 128])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.in_proj_weight torch.Size([384, 128])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.in_proj_bias torch.Size([384])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.out_proj.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_encoder.layers.0.self_attn.out_proj.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear1.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear2.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_encoder.layers.0.linear2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm1.weight torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm2.weight torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.0.norm2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.self_attn.in_proj_weight torch.Size([384, 128])
Adam-mini found the param block with name: transformer_encoder.layers.1.self_attn.in_proj_bias torch.Size([384])
Adam-mini found the param block with name: transformer_encoder.layers.1.self_attn.out_proj.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_encoder.layers.1.self_attn.out_proj.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.linear1.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_encoder.layers.1.linear1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.linear2.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_encoder.layers.1.linear2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.norm1.weight torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.norm1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.norm2.weight torch.Size([128])
Adam-mini found the param block with name: transformer_encoder.layers.1.norm2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.in_proj_weight torch.Size([384, 128])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.in_proj_bias torch.Size([384])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.out_proj.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.0.self_attn.out_proj.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.in_proj_weight torch.Size([384, 128])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.in_proj_bias torch.Size([384])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.out_proj.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.0.multihead_attn.out_proj.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear1.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear2.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.0.linear2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm1.weight torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm2.weight torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm3.weight torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.0.norm3.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.self_attn.in_proj_weight torch.Size([384, 128])
Adam-mini found the param block with name: transformer_decoder.layers.1.self_attn.in_proj_bias torch.Size([384])
Adam-mini found the param block with name: transformer_decoder.layers.1.self_attn.out_proj.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.1.self_attn.out_proj.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.multihead_attn.in_proj_weight torch.Size([384, 128])
Adam-mini found the param block with name: transformer_decoder.layers.1.multihead_attn.in_proj_bias torch.Size([384])
Adam-mini found the param block with name: transformer_decoder.layers.1.multihead_attn.out_proj.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.1.multihead_attn.out_proj.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.linear1.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.1.linear1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.linear2.weight torch.Size([128, 128])
Adam-mini found the param block with name: transformer_decoder.layers.1.linear2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.norm1.weight torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.norm1.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.norm2.weight torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.norm2.bias torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.norm3.weight torch.Size([128])
Adam-mini found the param block with name: transformer_decoder.layers.1.norm3.bias torch.Size([128])
Adam-mini found the param block with name: norm.weight torch.Size([128])
Adam-mini found the param block with name: norm.bias torch.Size([128])
Adam-mini found the param block with name: projection.bias torch.Size([30522])
Adam-mini found the param block with name: denoise_head.0.bias torch.Size([128])
Adam-mini found the param block with name: denoise_head.0.parametrizations.weight.original0 torch.Size([128, 1, 1])
Adam-mini found the param block with name: denoise_head.0.parametrizations.weight.original1 torch.Size([128, 128, 3])
Adam-mini found 0 embedding layers, 0 output layers; 0 Querys and Keys;  0 Values;  0 attn_proj;  0 MLPs;
=====>>> Warning by Adam-mini: No embedding layer found. If you are training Transformers, please check the name of your embedding layer and manually add them to 'self.embd_names' of Adam-mini. You can do this by adding an additional line of code: optimizer.embd_names.add('the keywords in the name of your embedding layer'). 
=====>>> Warning by Adam-mini: No output layer found. If you are training Transformers (without weight-tying), please check the name of your output layer and manually add them to 'self.output_names' of Adam-mini. You can do this by adding an additional line of code: optimizer.output_names.add('the keywords in the  name of your output layer').  Please ignore this warning if you are using weight-tying.
=====>>>  Warning by Adam-mini: No Query or Key found. If you are training Transformers, please check the name of your Query and Key in attention blocks and manually add them to 'self.wqk_names' of Adam-mini. You can do this by adding two additional lines of code: optimizer.wqk_names.add('the keywords in the  name of your Query' ); optimizer.wqk_names.add('the keywords in the  name of your Key'). 
=====>>>  Warning by Adam-mini: No Value found. If you are training Transformers, please check the name of your Value in attention blocks and manually add them to 'self.wv_names' of Adam-mini. You can do this by adding an additional lines of code: optimizer.wv_names.add('the keywords in the  name of your Value' ). 
=====>>>  Warning by Adam-mini: No attn_proj found. If you are training Transformers, please check the name of your attn_proj in attention blocks and manually add them to 'self.attn_proj_names' of Adam-mini. You can do this by adding an additional lines of code: optimizer.attn_proj_names.add('the keywords in the  name of your attn_proj' ). 
=====>>>  Warning by Adam-mini: No MLP found. If you are training Transformers, please check the name of your MLP in attention blocks and manually add them to 'self.mlp_names' of Adam-mini. You can do this by adding an additional lines of code: optimizer.attn_proj_names.add('the keywords in the  name of your MLP' ). 
=====>>>  Warning by Adam-mini: you are using default PyTorch partition for Adam-mini. It can cause training instability on large-scale Transformers.

I met the same Warning (No XXX found), when i try to run run_gpt2.sh (gpt2_small)

@Sun2018421
Copy link

Sun2018421 commented Nov 5, 2024

I thought the issue came from the self.named_parameters in the class Adam-mini. Because it's a generator, it will have no elements after we have a loop in the init function. So the warning will appear the second time call on this generator in the count_block function. The problem was solved when I changed the code
"self.named_parameters = list(named_parameters)"
"for param_name, param in self.named_parameters:"
in init function. I hope this was helpful.
Hi @zyushun , I am not sure if there is a problem with my configuration, but the warning is currently resolved. hope you can help me to confirm whether my scheme conforms to your design

@zyushun
Copy link
Owner

zyushun commented Nov 5, 2024

@Sun2018421 Hi! Thanks for the update.

Sorry for the late response since I am traveling recently. I will get back to your question as soon as I am settled.

Yushun

@Sun2018421
Copy link

@zyushun Thank you very much for your reply and wish you a pleasant trip :)

@zyushun
Copy link
Owner

zyushun commented Nov 29, 2024

Hi @Sun2018421 , we have updated Adam-mini to v.1.1.1 and this issue is fixed. Please pip uninstall and install adam-mini again.

Thanks a lot for mentioning this issue! We have acknowledged your help in readme.

@buttercutter
Copy link

buttercutter commented Dec 3, 2024

@zyushun Thanks for getting it to v1.1.1

I am still getting the following runtime error with this latest version. It seems that the previous error popped up again even when mlp_names are removed ?

   File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/adam_mini/adam_mini.py", line 303, in step
     state["vmean"] = torch.zeros_like(state["m"][0:state["neuron_per_gpu"], 0:1],
                                       ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 IndexError: too many indices for tensor of dimension 1

@nighting0le01
Copy link

@zyushun @buttercutter i get something like

  File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/adam_mini/adam_mini.py", line 303, in step
     state["vmean"] = torch.zeros_like(state["m"][0:state["neuron_per_gpu"], 0:1],
                                       ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 IndexError: too many indices for tensor of dimension 1

any sollution

@nighting0le01
Copy link

i think with FSDP and tensor-paralllel you will get

  File "/opt/homebrew/anaconda3/lib/python3.11/site-packages/adam_mini/adam_mini.py", line 303, in step
     state["vmean"] = torch.zeros_like(state["m"][0:state["neuron_per_gpu"], 0:1],
                                       ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 IndexError: too many indices for tensor of dimension 1

@nighting0le01
Copy link

Hi, @buttercutter . Good to know this... This error is because the weight matrices (and the associated gradient matrices) in your codebase are all stretched as long vectors. This is a corner case that we are aware of, but we find it is relatively rare to occur. Our current implementation assumes all weight matrices are matrices rather than vectors, so we do not support this case. Sorry for the inconvenience.

A simple twist is to remove these lines.

optimizer.mlp_names.add("attn")  # remove this line 
optimizer.mlp_names.add("linear")  # remove this line

By removing these two lines, Adam-mini will use a single learning rate for each block under pytorch default partition (except for the embedding and output layers, where it will use Adam). There is no guarantee that this would perform well but you can have a try. Usually, it does not work well for pre-training but it can work for finetuning.

@buttercutter @zyushun @awgu how can it work with torchtitan Tensorparalllel and FSDP2 if that is the case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants