Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed is_pipe_parallel setting to fix pipeline-parallel inference #866

Merged
merged 4 commits into from
Apr 21, 2023

Conversation

curt-tigges
Copy link
Contributor

Fix for #854

@curt-tigges curt-tigges requested a review from a team as a code owner March 31, 2023 15:24
@crazyofapple
Copy link
Contributor

/gpt-neox/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GPT2ModelPipe:
Missing key(s) in state_dict: "0.word_embeddings.weight", "2.input_layernorm.scale", "2.attention.query_key_value.weight", "2.attention.query_key_value.bias", "2.attention.rotary_emb.inv_freq", "2.attention.dense.weight", "2.attention.dense.bias", "2.post_attention_layernorm.scale", "2.mlp.dense_h_to_4h.weight", "2.mlp.dense_h_to_4h.bias", "2.mlp.dense_4h_to_h.weight", "2.mlp.dense_4h_to_h.bias", "3.input_layernorm.scale", "3.attention.query_key_value.weight", "3.attention.query_key_value.bias", "3.attention.rotary_emb.inv_freq", "3.attention.dense.weight", "3.attention.dense.bias", "3.post_attention_layernorm.scale", "3.mlp.dense_h_to_4h.weight", "3.mlp.dense_h_to_4h.bias", "3.mlp.dense_4h_to_h.weight", "3.mlp.dense_4h_to_h.bias", "4.input_layernorm.scale", "4.attention.query_key_value.weight", "4.attention.query_key_value.bias", "4.attention.rotary_emb.inv_freq", "4.attention.dense.weight", "4.attention.dense.bias", "4.post_attention_layernorm.scale", "4.mlp.dense_h_to_4h.weight", "4.mlp.dense_h_to_4h.bias", "4.mlp.dense_4h_to_h.weight", "4.mlp.dense_4h_to_h.bias", "5.input_layernorm.scale", "5.attention.query_key_value.weight", "5.attention.query_key_value.bias", "5.attention.rotary_emb.inv_freq", "5.attention.dense.weight", "5.attention.dense.bias", "5.post_attention_layernorm.scale", "5.mlp.dense_h_to_4h.weight", "5.mlp.dense_h_to_4h.bias", "5.mlp.dense_4h_to_h.weight", "5.mlp.dense_4h_to_h.bias", "6.input_layernorm.scale", "6.attention.query_key_value.weight", "6.attention.query_key_value.bias", "6.attention.rotary_emb.inv_freq", "6.attention.dense.weight", "6.attention.dense.bias", "6.post_attention_layernorm.scale", "6.mlp.dense_h_to_4h.weight", "6.mlp.dense_h_to_4h.bias", "6.mlp.dense_4h_to_h.weight", "6.mlp.dense_4h_to_h.bias", "7.input_layernorm.scale", "7.attention.query_key_value.weight", "7.attention.query_key_value.bias", "7.attention.rotary_emb.inv_freq", "7.attention.dense.weight", "7.attention.dense.bias", "7.post_attention_layernorm.scale", "7.mlp.dense_h_to_4h.weight", "7.mlp.dense_h_to_4h.bias", "7.mlp.dense_4h_to_h.weight", "7.mlp.dense_4h_to_h.bias", "8.input_layernorm.scale", "8.attention.query_key_value.weight", "8.attention.query_key_value.bias", "8.attention.rotary_emb.inv_freq", "8.attention.dense.weight", "8.attention.dense.bias", "8.post_attention_layernorm.scale", "8.mlp.dense_h_to_4h.weight", "8.mlp.dense_h_to_4h.bias", "8.mlp.dense_4h_to_h.weight", "8.mlp.dense_4h_to_h.bias", "9.input_layernorm.scale", "9.attention.query_key_value.weight", "9.attention.query_key_value.bias", "9.attention.rotary_emb.inv_freq", "9.attention.dense.weight", "9.attention.dense.bias", "9.post_attention_layernorm.scale", "9.mlp.dense_h_to_4h.weight", "9.mlp.dense_h_to_4h.bias", "9.mlp.dense_4h_to_h.weight", "9.mlp.dense_4h_to_h.bias", "10.input_layernorm.scale", "10.attention.query_key_value.weight", "10.attention.query_key_value.bias", "10.attention.rotary_emb.inv_freq", "10.attention.dense.weight", "10.attention.dense.bias", "10.post_attention_layernorm.scale", "10.mlp.dense_h_to_4h.weight", "10.mlp.dense_h_to_4h.bias", "10.mlp.dense_4h_to_h.weight", "10.mlp.dense_4h_to_h.bias", "11.input_layernorm.scale", "11.attention.query_key_value.weight", "11.attention.query_key_value.bias", "11.attention.rotary_emb.inv_freq", "11.attention.dense.weight", "11.attention.dense.bias", "11.post_attention_layernorm.scale", "11.mlp.dense_h_to_4h.weight", "11.mlp.dense_h_to_4h.bias", "11.mlp.dense_4h_to_h.weight", "11.mlp.dense_4h_to_h.bias", "12.input_layernorm.scale", "12.attention.query_key_value.weight", "12.attention.query_key_value.bias", "12.attention.rotary_emb.inv_freq", "12.attention.dense.weight", "12.attention.dense.bias", "12.post_attention_layernorm.scale", "12.mlp.dense_h_to_4h.weight", "12.mlp.dense_h_to_4h.bias", "12.mlp.dense_4h_to_h.weight", "12.mlp.dense_4h_to_h.bias", "13.input_layernorm.scale", "13.attention.query_key_value.weight", "13.attention.query_key_value.bias", "13.attention.rotary_emb.inv_freq", "13.attention.dense.weight", "13.attention.dense.bias", "13.post_attention_layernorm.scale", "13.mlp.dense_h_to_4h.weight", "13.mlp.dense_h_to_4h.bias", "13.mlp.dense_4h_to_h.weight", "13.mlp.dense_4h_to_h.bias", "14.input_layernorm.scale", "14.attention.query_key_value.weight", "14.attention.query_key_value.bias", "14.attention.rotary_emb.inv_freq", "14.attention.dense.weight", "14.attention.dense.bias", "14.post_attention_layernorm.scale", "14.mlp.dense_h_to_4h.weight", "14.mlp.dense_h_to_4h.bias", "14.mlp.dense_4h_to_h.weight", "14.mlp.dense_4h_to_h.bias", "15.input_layernorm.scale", "15.attention.query_key_value.weight", "15.attention.query_key_value.bias", "15.attention.rotary_emb.inv_freq", "15.attention.dense.weight", "15.attention.dense.bias", "15.post_attention_layernorm.scale", "15.mlp.dense_h_to_4h.weight", "15.mlp.dense_h_to_4h.bias", "15.mlp.dense_4h_to_h.weight", "15.mlp.dense_4h_to_h.bias", "16.input_layernorm.scale", "16.attention.query_key_value.weight", "16.attention.query_key_value.bias", "16.attention.rotary_emb.inv_freq", "16.attention.dense.weight", "16.attention.dense.bias", "16.post_attention_layernorm.scale", "16.mlp.dense_h_to_4h.weight", "16.mlp.dense_h_to_4h.bias", "16.mlp.dense_4h_to_h.weight", "16.mlp.dense_4h_to_h.bias", "17.input_layernorm.scale", "17.attention.query_key_value.weight", "17.attention.query_key_value.bias", "17.attention.rotary_emb.inv_freq", "17.attention.dense.weight", "17.attention.dense.bias", "17.post_attention_layernorm.scale", "17.mlp.dense_h_to_4h.weight", "17.mlp.dense_h_to_4h.bias", "17.mlp.dense_4h_to_h.weight", "17.mlp.dense_4h_to_h.bias", "18.input_layernorm.scale", "18.attention.query_key_value.weight", "18.attention.query_key_value.bias", "18.attention.rotary_emb.inv_freq", "18.attention.dense.weight", "18.attention.dense.bias", "18.post_attention_layernorm.scale", "18.mlp.dense_h_to_4h.weight", "18.mlp.dense_h_to_4h.bias", "18.mlp.dense_4h_to_h.weight", "18.mlp.dense_4h_to_h.bias", "19.input_layernorm.scale", "19.attention.query_key_value.weight", "19.attention.query_key_value.bias", "19.attention.rotary_emb.inv_freq", "19.attention.dense.weight", "19.attention.dense.bias", "19.post_attention_layernorm.scale", "19.mlp.dense_h_to_4h.weight", "19.mlp.dense_h_to_4h.bias", "19.mlp.dense_4h_to_h.weight", "19.mlp.dense_4h_to_h.bias", "20.input_layernorm.scale", "20.attention.query_key_value.weight", "20.attention.query_key_value.bias", "20.attention.rotary_emb.inv_freq", "20.attention.dense.weight", "20.attention.dense.bias", "20.post_attention_layernorm.scale", "20.mlp.dense_h_to_4h.weight", "20.mlp.dense_h_to_4h.bias", "20.mlp.dense_4h_to_h.weight", "20.mlp.dense_4h_to_h.bias", "21.input_layernorm.scale", "21.attention.query_key_value.weight", "21.attention.query_key_value.bias", "21.attention.rotary_emb.inv_freq", "21.attention.dense.weight", "21.attention.dense.bias", "21.post_attention_layernorm.scale", "21.mlp.dense_h_to_4h.weight", "21.mlp.dense_h_to_4h.bias", "21.mlp.dense_4h_to_h.weight", "21.mlp.dense_4h_to_h.bias", "22.input_layernorm.scale", "22.attention.query_key_value.weight", "22.attention.query_key_value.bias", "22.attention.rotary_emb.inv_freq", "22.attention.dense.weight", "22.attention.dense.bias", "22.post_attention_layernorm.scale", "22.mlp.dense_h_to_4h.weight", "22.mlp.dense_h_to_4h.bias", "22.mlp.dense_4h_to_h.weight", "22.mlp.dense_4h_to_h.bias", "23.input_layernorm.scale", "23.attention.query_key_value.weight", "23.attention.query_key_value.bias", "23.attention.rotary_emb.inv_freq", "23.attention.dense.weight", "23.attention.dense.bias", "23.post_attention_layernorm.scale", "23.mlp.dense_h_to_4h.weight", "23.mlp.dense_h_to_4h.bias", "23.mlp.dense_4h_to_h.weight", "23.mlp.dense_4h_to_h.bias", "24.input_layernorm.scale", "24.attention.query_key_value.weight", "24.attention.query_key_value.bias", "24.attention.rotary_emb.inv_freq", "24.attention.dense.weight", "24.attention.dense.bias", "24.post_attention_layernorm.scale", "24.mlp.dense_h_to_4h.weight", "24.mlp.dense_h_to_4h.bias", "24.mlp.dense_4h_to_h.weight", "24.mlp.dense_4h_to_h.bias", "25.input_layernorm.scale", "25.attention.query_key_value.weight", "25.attention.query_key_value.bias", "25.attention.rotary_emb.inv_freq", "25.attention.dense.weight", "25.attention.dense.bias", "25.post_attention_layernorm.scale", "25.mlp.dense_h_to_4h.weight", "25.mlp.dense_h_to_4h.bias", "25.mlp.dense_4h_to_h.weight", "25.mlp.dense_4h_to_h.bias", "26.input_layernorm.scale", "26.attention.query_key_value.weight", "26.attention.query_key_value.bias", "26.attention.rotary_emb.inv_freq", "26.attention.dense.weight", "26.attention.dense.bias", "26.post_attention_layernorm.scale", "26.mlp.dense_h_to_4h.weight", "26.mlp.dense_h_to_4h.bias", "26.mlp.dense_4h_to_h.weight", "26.mlp.dense_4h_to_h.bias", "27.input_layernorm.scale", "27.attention.query_key_value.weight", "27.attention.query_key_value.bias", "27.attention.rotary_emb.inv_freq", "27.attention.dense.weight", "27.attention.dense.bias", "27.post_attention_layernorm.scale", "27.mlp.dense_h_to_4h.weight", "27.mlp.dense_h_to_4h.bias", "27.mlp.dense_4h_to_h.weight", "27.mlp.dense_4h_to_h.bias", "28.input_layernorm.scale", "28.attention.query_key_value.weight", "28.attention.query_key_value.bias", "28.attention.rotary_emb.inv_freq", "28.attention.dense.weight", "28.attention.dense.bias", "28.post_attention_layernorm.scale", "28.mlp.dense_h_to_4h.weight", "28.mlp.dense_h_to_4h.bias", "28.mlp.dense_4h_to_h.weight", "28.mlp.dense_4h_to_h.bias", "29.input_layernorm.scale", "29.attention.query_key_value.weight", "29.attention.query_key_value.bias", "29.attention.rotary_emb.inv_freq", "29.attention.dense.weight", "29.attention.dense.bias", "29.post_attention_layernorm.scale", "29.mlp.dense_h_to_4h.weight", "29.mlp.dense_h_to_4h.bias", "29.mlp.dense_4h_to_h.weight", "29.mlp.dense_4h_to_h.bias", "30.input_layernorm.scale", "30.attention.query_key_value.weight", "30.attention.query_key_value.bias", "30.attention.rotary_emb.inv_freq", "30.attention.dense.weight", "30.attention.dense.bias", "30.post_attention_layernorm.scale", "30.mlp.dense_h_to_4h.weight", "30.mlp.dense_h_to_4h.bias", "30.mlp.dense_4h_to_h.weight", "30.mlp.dense_4h_to_h.bias", "31.input_layernorm.scale", "31.attention.query_key_value.weight", "31.attention.query_key_value.bias", "31.attention.rotary_emb.inv_freq", "31.attention.dense.weight", "31.attention.dense.bias", "31.post_attention_layernorm.scale", "31.mlp.dense_h_to_4h.weight", "31.mlp.dense_h_to_4h.bias", "31.mlp.dense_4h_to_h.weight", "31.mlp.dense_4h_to_h.bias", "32.input_layernorm.scale", "32.attention.query_key_value.weight", "32.attention.query_key_value.bias", "32.attention.rotary_emb.inv_freq", "32.attention.dense.weight", "32.attention.dense.bias", "32.post_attention_layernorm.scale", "32.mlp.dense_h_to_4h.weight", "32.mlp.dense_h_to_4h.bias", "32.mlp.dense_4h_to_h.weight", "32.mlp.dense_4h_to_h.bias", "33.input_layernorm.scale", "33.attention.query_key_value.weight", "33.attention.query_key_value.bias", "33.attention.rotary_emb.inv_freq", "33.attention.dense.weight", "33.attention.dense.bias", "33.post_attention_layernorm.scale", "33.mlp.dense_h_to_4h.weight", "33.mlp.dense_h_to_4h.bias", "33.mlp.dense_4h_to_h.weight", "33.mlp.dense_4h_to_h.bias", "35.norm.scale", "36.final_linear.weight".
Unexpected key(s) in state_dict: "sequential.0.word_embeddings.weight", "sequential.2.input_layernorm.scale", "sequential.2.attention.query_key_value.weight", "sequential.2.attention.query_key_value.bias", "sequential.2.attention.rotary_emb.inv_freq", "sequential.2.attention.dense.weight", "sequential.2.attention.dense.bias", "sequential.2.post_attention_layernorm.scale", "sequential.2.mlp.dense_h_to_4h.weight", "sequential.2.mlp.dense_h_to_4h.bias", "sequential.2.mlp.dense_4h_to_h.weight", "sequential.2.mlp.dense_4h_to_h.bias", "sequential.3.input_layernorm.scale", "sequential.3.attention.query_key_value.weight", "sequential.3.attention.query_key_value.bias", "sequential.3.attention.rotary_emb.inv_freq", "sequential.3.attention.dense.weight", "sequential.3.attention.dense.bias", "sequential.3.post_attention_layernorm.scale", "sequential.3.mlp.dense_h_to_4h.weight", "sequential.3.mlp.dense_h_to_4h.bias", "sequential.3.mlp.dense_4h_to_h.weight", "sequential.3.mlp.dense_4h_to_h.bias", "sequential.4.input_layernorm.scale", "sequential.4.attention.query_key_value.weight", "sequential.4.attention.query_key_value.bias", "sequential.4.attention.rotary_emb.inv_freq", "sequential.4.attention.dense.weight", "sequential.4.attention.dense.bias", "sequential.4.post_attention_layernorm.scale", "sequential.4.mlp.dense_h_to_4h.weight", "sequential.4.mlp.dense_h_to_4h.bias", "sequential.4.mlp.dense_4h_to_h.weight", "sequential.4.mlp.dense_4h_to_h.bias", "sequential.5.input_layernorm.scale", "sequential.5.attention.query_key_value.weight", "sequential.5.attention.query_key_value.bias", "sequential.5.attention.rotary_emb.inv_freq", "sequential.5.attention.dense.weight", "sequential.5.attention.dense.bias", "sequential.5.post_attention_layernorm.scale", "sequential.5.mlp.dense_h_to_4h.weight", "sequential.5.mlp.dense_h_to_4h.bias", "sequential.5.mlp.dense_4h_to_h.weight", "sequential.5.mlp.dense_4h_to_h.bias", "sequential.6.input_layernorm.scale", "sequential.6.attention.query_key_value.weight", "sequential.6.attention.query_key_value.bias", "sequential.6.attention.rotary_emb.inv_freq", "sequential.6.attention.dense.weight", "sequential.6.attention.dense.bias", "sequential.6.post_attention_layernorm.scale", "sequential.6.mlp.dense_h_to_4h.weight", "sequential.6.mlp.dense_h_to_4h.bias", "sequential.6.mlp.dense_4h_to_h.weight", "sequential.6.mlp.dense_4h_to_h.bias", "sequential.7.input_layernorm.scale", "sequential.7.attention.query_key_value.weight", "sequential.7.attention.query_key_value.bias", "sequential.7.attention.rotary_emb.inv_freq", "sequential.7.attention.dense.weight", "sequential.7.attention.dense.bias", "sequential.7.post_attention_layernorm.scale", "sequential.7.mlp.dense_h_to_4h.weight", "sequential.7.mlp.dense_h_to_4h.bias", "sequential.7.mlp.dense_4h_to_h.weight", "sequential.7.mlp.dense_4h_to_h.bias", "sequential.8.input_layernorm.scale", "sequential.8.attention.query_key_value.weight", "sequential.8.attention.query_key_value.bias", "sequential.8.attention.rotary_emb.inv_freq", "sequential.8.attention.dense.weight", "sequential.8.attention.dense.bias", "sequential.8.post_attention_layernorm.scale", "sequential.8.mlp.dense_h_to_4h.weight", "sequential.8.mlp.dense_h_to_4h.bias", "sequential.8.mlp.dense_4h_to_h.weight", "sequential.8.mlp.dense_4h_to_h.bias", "sequential.9.input_layernorm.scale", "sequential.9.attention.query_key_value.weight", "sequential.9.attention.query_key_value.bias", "sequential.9.attention.rotary_emb.inv_freq", "sequential.9.attention.dense.weight", "sequential.9.attention.dense.bias", "sequential.9.post_attention_layernorm.scale", "sequential.9.mlp.dense_h_to_4h.weight", "sequential.9.mlp.dense_h_to_4h.bias", "sequential.9.mlp.dense_4h_to_h.weight", "sequential.9.mlp.dense_4h_to_h.bias", "sequential.10.input_layernorm.scale", "sequential.10.attention.query_key_value.weight", "sequential.10.attention.query_key_value.bias", "sequential.10.attention.rotary_emb.inv_freq", "sequential.10.attention.dense.weight", "sequential.10.attention.dense.bias", "sequential.10.post_attention_layernorm.scale", "sequential.10.mlp.dense_h_to_4h.weight", "sequential.10.mlp.dense_h_to_4h.bias", "sequential.10.mlp.dense_4h_to_h.weight", "sequential.10.mlp.dense_4h_to_h.bias", "sequential.11.input_layernorm.scale", "sequential.11.attention.query_key_value.weight", "sequential.11.attention.query_key_value.bias", "sequential.11.attention.rotary_emb.inv_freq", "sequential.11.attention.dense.weight", "sequential.11.attention.dense.bias", "sequential.11.post_attention_layernorm.scale", "sequential.11.mlp.dense_h_to_4h.weight", "sequential.11.mlp.dense_h_to_4h.bias", "sequential.11.mlp.dense_4h_to_h.weight", "sequential.11.mlp.dense_4h_to_h.bias", "sequential.12.input_layernorm.scale", "sequential.12.attention.query_key_value.weight", "sequential.12.attention.query_key_value.bias", "sequential.12.attention.rotary_emb.inv_freq", "sequential.12.attention.dense.weight", "sequential.12.attention.dense.bias", "sequential.12.post_attention_layernorm.scale", "sequential.12.mlp.dense_h_to_4h.weight", "sequential.12.mlp.dense_h_to_4h.bias", "sequential.12.mlp.dense_4h_to_h.weight", "sequential.12.mlp.dense_4h_to_h.bias", "sequential.13.input_layernorm.scale", "sequential.13.attention.query_key_value.weight", "sequential.13.attention.query_key_value.bias", "sequential.13.attention.rotary_emb.inv_freq", "sequential.13.attention.dense.weight", "sequential.13.attention.dense.bias", "sequential.13.post_attention_layernorm.scale", "sequential.13.mlp.dense_h_to_4h.weight", "sequential.13.mlp.dense_h_to_4h.bias", "sequential.13.mlp.dense_4h_to_h.weight", "sequential.13.mlp.dense_4h_to_h.bias", "sequential.14.input_layernorm.scale", "sequential.14.attention.query_key_value.weight", "sequential.14.attention.query_key_value.bias", "sequential.14.attention.rotary_emb.inv_freq", "sequential.14.attention.dense.weight", "sequential.14.attention.dense.bias", "sequential.14.post_attention_layernorm.scale", "sequential.14.mlp.dense_h_to_4h.weight", "sequential.14.mlp.dense_h_to_4h.bias", "sequential.14.mlp.dense_4h_to_h.weight", "sequential.14.mlp.dense_4h_to_h.bias", "sequential.15.input_layernorm.scale", "sequential.15.attention.query_key_value.weight", "sequential.15.attention.query_key_value.bias", "sequential.15.attention.rotary_emb.inv_freq", "sequential.15.attention.dense.weight", "sequential.15.attention.dense.bias", "sequential.15.post_attention_layernorm.scale", "sequential.15.mlp.dense_h_to_4h.weight", "sequential.15.mlp.dense_h_to_4h.bias", "sequential.15.mlp.dense_4h_to_h.weight", "sequential.15.mlp.dense_4h_to_h.bias", "sequential.16.input_layernorm.scale", "sequential.16.attention.query_key_value.weight", "sequential.16.attention.query_key_value.bias", "sequential.16.attention.rotary_emb.inv_freq", "sequential.16.attention.dense.weight", "sequential.16.attention.dense.bias", "sequential.16.post_attention_layernorm.scale", "sequential.16.mlp.dense_h_to_4h.weight", "sequential.16.mlp.dense_h_to_4h.bias", "sequential.16.mlp.dense_4h_to_h.weight", "sequential.16.mlp.dense_4h_to_h.bias", "sequential.17.input_layernorm.scale", "sequential.17.attention.query_key_value.weight", "sequential.17.attention.query_key_value.bias", "sequential.17.attention.rotary_emb.inv_freq", "sequential.17.attention.dense.weight", "sequential.17.attention.dense.bias", "sequential.17.post_attention_layernorm.scale", "sequential.17.mlp.dense_h_to_4h.weight", "sequential.17.mlp.dense_h_to_4h.bias", "sequential.17.mlp.dense_4h_to_h.weight", "sequential.17.mlp.dense_4h_to_h.bias", "sequential.18.input_layernorm.scale", "sequential.18.attention.query_key_value.weight", "sequential.18.attention.query_key_value.bias", "sequential.18.attention.rotary_emb.inv_freq", "sequential.18.attention.dense.weight", "sequential.18.attention.dense.bias", "sequential.18.post_attention_layernorm.scale", "sequential.18.mlp.dense_h_to_4h.weight", "sequential.18.mlp.dense_h_to_4h.bias", "sequential.18.mlp.dense_4h_to_h.weight", "sequential.18.mlp.dense_4h_to_h.bias", "sequential.19.input_layernorm.scale", "sequential.19.attention.query_key_value.weight", "sequential.19.attention.query_key_value.bias", "sequential.19.attention.rotary_emb.inv_freq", "sequential.19.attention.dense.weight", "sequential.19.attention.dense.bias", "sequential.19.post_attention_layernorm.scale", "sequential.19.mlp.dense_h_to_4h.weight", "sequential.19.mlp.dense_h_to_4h.bias", "sequential.19.mlp.dense_4h_to_h.weight", "sequential.19.mlp.dense_4h_to_h.bias", "sequential.20.input_layernorm.scale", "sequential.20.attention.query_key_value.weight", "sequential.20.attention.query_key_value.bias", "sequential.20.attention.rotary_emb.inv_freq", "sequential.20.attention.dense.weight", "sequential.20.attention.dense.bias", "sequential.20.post_attention_layernorm.scale", "sequential.20.mlp.dense_h_to_4h.weight", "sequential.20.mlp.dense_h_to_4h.bias", "sequential.20.mlp.dense_4h_to_h.weight", "sequential.20.mlp.dense_4h_to_h.bias", "sequential.21.input_layernorm.scale", "sequential.21.attention.query_key_value.weight", "sequential.21.attention.query_key_value.bias", "sequential.21.attention.rotary_emb.inv_freq", "sequential.21.attention.dense.weight", "sequential.21.attention.dense.bias", "sequential.21.post_attention_layernorm.scale", "sequential.21.mlp.dense_h_to_4h.weight", "sequential.21.mlp.dense_h_to_4h.bias", "sequential.21.mlp.dense_4h_to_h.weight", "sequential.21.mlp.dense_4h_to_h.bias", "sequential.22.input_layernorm.scale", "sequential.22.attention.query_key_value.weight", "sequential.22.attention.query_key_value.bias", "sequential.22.attention.rotary_emb.inv_freq", "sequential.22.attention.dense.weight", "sequential.22.attention.dense.bias", "sequential.22.post_attention_layernorm.scale", "sequential.22.mlp.dense_h_to_4h.weight", "sequential.22.mlp.dense_h_to_4h.bias", "sequential.22.mlp.dense_4h_to_h.weight", "sequential.22.mlp.dense_4h_to_h.bias", "sequential.23.input_layernorm.scale", "sequential.23.attention.query_key_value.weight", "sequential.23.attention.query_key_value.bias", "sequential.23.attention.rotary_emb.inv_freq", "sequential.23.attention.dense.weight", "sequential.23.attention.dense.bias", "sequential.23.post_attention_layernorm.scale", "sequential.23.mlp.dense_h_to_4h.weight", "sequential.23.mlp.dense_h_to_4h.bias", "sequential.23.mlp.dense_4h_to_h.weight", "sequential.23.mlp.dense_4h_to_h.bias", "sequential.24.input_layernorm.scale", "sequential.24.attention.query_key_value.weight", "sequential.24.attention.query_key_value.bias", "sequential.24.attention.rotary_emb.inv_freq", "sequential.24.attention.dense.weight", "sequential.24.attention.dense.bias", "sequential.24.post_attention_layernorm.scale", "sequential.24.mlp.dense_h_to_4h.weight", "sequential.24.mlp.dense_h_to_4h.bias", "sequential.24.mlp.dense_4h_to_h.weight", "sequential.24.mlp.dense_4h_to_h.bias", "sequential.25.input_layernorm.scale", "sequential.25.attention.query_key_value.weight", "sequential.25.attention.query_key_value.bias", "sequential.25.attention.rotary_emb.inv_freq", "sequential.25.attention.dense.weight", "sequential.25.attention.dense.bias", "sequential.25.post_attention_layernorm.scale", "sequential.25.mlp.dense_h_to_4h.weight", "sequential.25.mlp.dense_h_to_4h.bias", "sequential.25.mlp.dense_4h_to_h.weight", "sequential.25.mlp.dense_4h_to_h.bias", "sequential.26.input_layernorm.scale", "sequential.26.attention.query_key_value.weight", "sequential.26.attention.query_key_value.bias", "sequential.26.attention.rotary_emb.inv_freq", "sequential.26.attention.dense.weight", "sequential.26.attention.dense.bias", "sequential.26.post_attention_layernorm.scale", "sequential.26.mlp.dense_h_to_4h.weight", "sequential.26.mlp.dense_h_to_4h.bias", "sequential.26.mlp.dense_4h_to_h.weight", "sequential.26.mlp.dense_4h_to_h.bias", "sequential.27.input_layernorm.scale", "sequential.27.attention.query_key_value.weight", "sequential.27.attention.query_key_value.bias", "sequential.27.attention.rotary_emb.inv_freq", "sequential.27.attention.dense.weight", "sequential.27.attention.dense.bias", "sequential.27.post_attention_layernorm.scale", "sequential.27.mlp.dense_h_to_4h.weight", "sequential.27.mlp.dense_h_to_4h.bias", "sequential.27.mlp.dense_4h_to_h.weight", "sequential.27.mlp.dense_4h_to_h.bias", "sequential.28.input_layernorm.scale", "sequential.28.attention.query_key_value.weight", "sequential.28.attention.query_key_value.bias", "sequential.28.attention.rotary_emb.inv_freq", "sequential.28.attention.dense.weight", "sequential.28.attention.dense.bias", "sequential.28.post_attention_layernorm.scale", "sequential.28.mlp.dense_h_to_4h.weight", "sequential.28.mlp.dense_h_to_4h.bias", "sequential.28.mlp.dense_4h_to_h.weight", "sequential.28.mlp.dense_4h_to_h.bias", "sequential.29.input_layernorm.scale", "sequential.29.attention.query_key_value.weight", "sequential.29.attention.query_key_value.bias", "sequential.29.attention.rotary_emb.inv_freq", "sequential.29.attention.dense.weight", "sequential.29.attention.dense.bias", "sequential.29.post_attention_layernorm.scale", "sequential.29.mlp.dense_h_to_4h.weight", "sequential.29.mlp.dense_h_to_4h.bias", "sequential.29.mlp.dense_4h_to_h.weight", "sequential.29.mlp.dense_4h_to_h.bias", "sequential.30.input_layernorm.scale", "sequential.30.attention.query_key_value.weight", "sequential.30.attention.query_key_value.bias", "sequential.30.attention.rotary_emb.inv_freq", "sequential.30.attention.dense.weight", "sequential.30.attention.dense.bias", "sequential.30.post_attention_layernorm.scale", "sequential.30.mlp.dense_h_to_4h.weight", "sequential.30.mlp.dense_h_to_4h.bias", "sequential.30.mlp.dense_4h_to_h.weight", "sequential.30.mlp.dense_4h_to_h.bias", "sequential.31.input_layernorm.scale", "sequential.31.attention.query_key_value.weight", "sequential.31.attention.query_key_value.bias", "sequential.31.attention.rotary_emb.inv_freq", "sequential.31.attention.dense.weight", "sequential.31.attention.dense.bias", "sequential.31.post_attention_layernorm.scale", "sequential.31.mlp.dense_h_to_4h.weight", "sequential.31.mlp.dense_h_to_4h.bias", "sequential.31.mlp.dense_4h_to_h.weight", "sequential.31.mlp.dense_4h_to_h.bias", "sequential.32.input_layernorm.scale", "sequential.32.attention.query_key_value.weight", "sequential.32.attention.query_key_value.bias", "sequential.32.attention.rotary_emb.inv_freq", "sequential.32.attention.dense.weight", "sequential.32.attention.dense.bias", "sequential.32.post_attention_layernorm.scale", "sequential.32.mlp.dense_h_to_4h.weight", "sequential.32.mlp.dense_h_to_4h.bias", "sequential.32.mlp.dense_4h_to_h.weight", "sequential.32.mlp.dense_4h_to_h.bias", "sequential.33.input_layernorm.scale", "sequential.33.attention.query_key_value.weight", "sequential.33.attention.query_key_value.bias", "sequential.33.attention.rotary_emb.inv_freq", "sequential.33.attention.dense.weight", "sequential.33.attention.dense.bias", "sequential.33.post_attention_layernorm.scale", "sequential.33.mlp.dense_h_to_4h.weight", "sequential.33.mlp.dense_h_to_4h.bias", "sequential.33.mlp.dense_4h_to_h.weight", "sequential.33.mlp.dense_4h_to_h.bias", "sequential.35.norm.scale", "sequential.36.final_linear.weight".

@StellaAthena StellaAthena self-assigned this Apr 17, 2023
@StellaAthena StellaAthena added the bug Something isn't working label Apr 17, 2023
@curt-tigges
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

@Quentin-Anthony
Copy link
Member

@crazyofapple -- You're seeing an error because you're trying to load a sequential checkpoint that you saved before the PR (with self.pipe_parallel_size >= 2, leading to a sequential model/ckpt), then tried to load it with the PR (with self.pipe_parallel_size >= 1), which tries to convert the checkpoint to a GPT2ModelPipe and fails.

If you need to load those model weights intact, you'll have to leave this commit out. Otherwise, delete that old checkpoint and update to this commit.

@Quentin-Anthony Quentin-Anthony merged commit 1faff79 into main Apr 21, 2023
@Quentin-Anthony Quentin-Anthony deleted the curt/parallel-inference branch April 21, 2023 17:21
bzantium pushed a commit that referenced this pull request Apr 26, 2023
* add flash_attn_kvpacked

* fix formatting

* accept changes from main & resolve conflicts

* Error

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* errors

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* feat(ci): add pip caching to CI

* Set training attribute appropriately

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Split up FlashAttention methods

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Comment out clear_cache

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Just remove clear_cache

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Fix pre-commit formatting

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Changed is_pipe_parallel setting to fix pipeline-parallel inference (#866)

* Changed is_pipe_parallel setting to fix pipeline-parallel inference

* Update NeoXArgs docs automatically

* Update NeoXArgs docs automatically

---------

Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>

* feat: improve typing

* Added DeeperSpeed to requirements.txt

* Update NeoXArgs docs automatically

* Update NeoXArgs docs automatically

* Update train.py

update train.py 
1. black formatter.
2. remove unnecessary import
3. add more arguments

* Update utils.py

Black formatting
Add logic required to expand "~"

* Update train.py

removed num_proc
temporarily disabled emoji
added continuing subword prefix option ( does not work well with Bytelevel)

* Update utils.py

improve reader error handling

* Update train.py

add whitespace related handling.
add whitespace argument expose
reconstruct pre_tokenizer_list
add more whitespace to check tokenizer invertibility

* Update train.py

* Update utils.py

remove unnecessary print

* Update train.py

set dropout default to None
import path related code.
Change normalizer
change buffer_tokens
change whitespace reservation handling

* Update train.py

Clear whitespace_reservation TODO
add single_whitespace argument (might be necessary for invertibility)

* Create .gitignore

add gitignore file to ignore artifacts

* Update train.py

add directory parsing error checks
add more metrics
(tokenizer reconstructions, unicode fallback portion)

* Update preprocess.py

path handling changes
black formatting

* Update train.py

change from GPT2TokenizerFast to PreTrainedTokenizerFast class

* Update train.py

enhanced test string

* Update utils.py

add logic to handle jsonl, txt input
add logic to handle folder with jsonl,txt or arrow dataset

* Update train.py

add byte_fallback option expose
(incompatible with current transformer wrapper)
change dataset_loading with new util.py
add dataset shuffling option

* Update utils.py

fix error in loading sequence

* Update train.py

fix whitespace preservation logic

* Update train.py

simplify data loading logic.
remove unnecessary special tokens

* Update train.py

remove emoji related code

* Update train.py

add whitespace processing regex
r"\s{16,}"

* update tokenizer

add whitespace pretokenizer
(only processes looong whitespaces)

* Update train.py

* Update train.py

add camel case regex

* Update train.py

separate camel_case regex

* Update train.py

* Update train.py

---------

Signed-off-by: Dashiell Stander <dstander@protonmail.com>
Co-authored-by: Satpal Singh Rathore <satpal.code@gmail.com>
Co-authored-by: Dashiell Stander <dstander@protonmail.com>
Co-authored-by: Saurav Maheshkar <sauravvmaheshkar@gmail.com>
Co-authored-by: Stella Biderman <stellabiderman@gmail.com>
Co-authored-by: Curt Tigges <ct@curttigges.com>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants