-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
85%+ of the llama model could be redundant #989
Comments
I'm just skimming through the paper quickly, I'm no AI expert whatsoever, but I do think I see one problem: the datasets they use to determine reduncancy focuses on English only, LLaMa supports a lot more languages that just English, including coding languages which would be pruned completely with their datasets. I don't think this is necessarily a show-stopping problem, it just means that in order to use this technique for LLaMa we'd need more datasets specifically suited for LLaMa. If you were to use their original dataset, LLaMa would become an English-only LLM.
The penn treebank development set
|
See also SparseGPT/LLaMa https://github.com/lachlansneff/sparsellama, https://arxiv.org/abs/2301.00774 |
I didn't dig into it yet just my 10 Cents: llama uses SwiGLU, Bert Gelu and others Relu. Next is the already mentioned factor: if you use a limited training set then you basicaly lobotomize all areas you didn't train at all and you damage those areas you did not train enough and you'll remove a lot of the nuances the model learned making it more "pragmatic" and less "creative". |
this might be true, but it would still be beneficial for specific use cases. (eg english only) the paper states for llama states:
|
just to mention: other Latin languages are very similar to English (same Latin word roots) and that other Cyrillic languages allow the model to be used as a simple translation engine (and make it more interesting for other countries and more people in general). I believe you can’t just remove it without any damage to the model. |
I wonder if pruning can be kind of like a form of finetuning where the resulting model is much smaller. For example, what if one pruned using the instruct data people are finetuning on? Or using the output distributions of a larger model as in knowledge distillation? In the latter case the model could possibly increase in strength rather than decrease. Today I found there is structured LLaMA pruning code at https://github.com/horseee/LLaMA-Pruning and https://github.com/VainF/Torch-Pruning . Note that high-quality pruning and quantization finetunes the model during the optimization to reduce the impact on performance. The above approach does not appear to do that. |
Has anyone pruned alpaca/vicuna and uploaded it somewhere? |
There is also knowledge distillation technique. I would like to se if someone would distill 65B into 7B model. |
what’s the sparsification news? (or was this issue closed inaccurately?) |
This could be used to prune (lobotomize) the model and remove unused "knowledge" to reduce the model size, required RAM and improve the inference performance. The neuron usage statistics could be collected for a given set of use cases (eg. leave the model running in production for some months and then stick to that used knowledge only).
When additional knowledge is required (when some sleeping neurons got activated) the model should load and connect those weight partitions. Unused weight partitions could be removed from memory by disconnecting those unused neurons similar to how the OS manages cache. |
I see lazy loading has already been implemented in #613. However, I believe this implementation is still loading and processing weights that may not contribute to the final inference result. Hence, we should distinguish between "used weights" and "relevant weights." The challenge here would be to dynamically detect which weight partitions will be relevant for the inference process. One relatively simple initial approach would be:
A more complex approach would involve identifying multiple mappings of relevant weight partitions and dynamically detecting which weights will be required by the subsequent layers. In other words, the model weights would be grouped by "knowledge topics" that are loaded and used only when required. |
Turns out that most LLM parameters are redundant, see https://aclanthology.org/2020.emnlp-main.398.pdf.
They run the experiment with BERT and XLNet. Code for the pruning is provided.
There's lots of room for improvement apparently, since LLama is very similar to those. If someone's interested, that could be a nice thing to try 😄
The text was updated successfully, but these errors were encountered: