Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

85%+ of the llama model could be redundant #989

Closed
teaalltr opened this issue Apr 14, 2023 · 12 comments
Closed

85%+ of the llama model could be redundant #989

teaalltr opened this issue Apr 14, 2023 · 12 comments

Comments

@teaalltr
Copy link

Turns out that most LLM parameters are redundant, see https://aclanthology.org/2020.emnlp-main.398.pdf.
They run the experiment with BERT and XLNet. Code for the pruning is provided.
There's lots of room for improvement apparently, since LLama is very similar to those. If someone's interested, that could be a nice thing to try 😄

@Azeirah
Copy link
Contributor

Azeirah commented Apr 15, 2023

https://github.com/fdalvi/analyzing-redundancy-in-pretrained-transformer-models

@Azeirah
Copy link
Contributor

Azeirah commented Apr 15, 2023

I'm just skimming through the paper quickly, I'm no AI expert whatsoever, but I do think I see one problem: the datasets they use to determine reduncancy focuses on English only, LLaMa supports a lot more languages that just English, including coding languages which would be pruned completely with their datasets.

I don't think this is necessarily a show-stopping problem, it just means that in order to use this technique for LLaMa we'd need more datasets specifically suited for LLaMa. If you were to use their original dataset, LLaMa would become an English-only LLM.

3.1 To analyze the general redundancy in pre-trained
models, we use the Penn Treebank development
set
(Marcus et al., 1993), which consists of roughly
44,000 tokens. For task-specific analysis, we use
two broad categories of downstream tasks – Se-
quence Labeling and Sequence Classification tasks.
For the sequence labeling tasks, we study core
linguistic tasks, i) part-of-speech (POS) tagging
using the Penn TreeBank, ii) CCG super tagging
using CCGBank (Hockenmaier, 2006), iii) seman-
tic tagging (SEM) using Parallel Meaning Bank
data (Abzianidze and Bos, 2017) and iv) syn-
tactic chunking using CoNLL 2000 shared task
dataset (Sang and Buchholz, 2000).

The penn treebank development set

Building a large annotated corpus of English: the Penn Treebank

@jon-chuang
Copy link
Contributor

See also SparseGPT/LLaMa https://github.com/lachlansneff/sparsellama, https://arxiv.org/abs/2301.00774

@cmp-nct
Copy link
Contributor

cmp-nct commented Apr 16, 2023

I didn't dig into it yet just my 10 Cents: llama uses SwiGLU, Bert Gelu and others Relu.
SwiGLU seems a super-heavy activation to me which also increases the perceived neuronal density of the network.

Next is the already mentioned factor: if you use a limited training set then you basicaly lobotomize all areas you didn't train at all and you damage those areas you did not train enough and you'll remove a lot of the nuances the model learned making it more "pragmatic" and less "creative".
I agree that this type of optimization is interesting but it comes with not so trivial consequences and complications.

@Green-Sky
Copy link
Collaborator

@Azeirah

the datasets they use to determine reduncancy focuses on English only, LLaMa supports a lot more languages that just English,

this might be true, but it would still be beneficial for specific use cases. (eg english only)

the paper states for llama states:

... performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n-gram language model.

languages, which use either the Latin or Cyrillic
scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it,
nl, pl, pt, ro, ru, sl, sr, sv, uk.

@netsvetaev
Copy link

netsvetaev commented Apr 17, 2023

just to mention: other Latin languages are very similar to English (same Latin word roots) and that other Cyrillic languages allow the model to be used as a simple translation engine (and make it more interesting for other countries and more people in general).

I believe you can’t just remove it without any damage to the model.

@xloem
Copy link
Contributor

xloem commented Apr 24, 2023

I wonder if pruning can be kind of like a form of finetuning where the resulting model is much smaller. For example, what if one pruned using the instruct data people are finetuning on? Or using the output distributions of a larger model as in knowledge distillation? In the latter case the model could possibly increase in strength rather than decrease.

Today I found there is structured LLaMA pruning code at https://github.com/horseee/LLaMA-Pruning and https://github.com/VainF/Torch-Pruning .

Note that high-quality pruning and quantization finetunes the model during the optimization to reduce the impact on performance. The above approach does not appear to do that.

@Alumniminium
Copy link

Has anyone pruned alpaca/vicuna and uploaded it somewhere?

@ivanstepanovftw
Copy link
Collaborator

There is also knowledge distillation technique. I would like to se if someone would distill 65B into 7B model.

@xloem
Copy link
Contributor

xloem commented Jul 28, 2023

what’s the sparsification news? (or was this issue closed inaccurately?)

@kripper
Copy link

kripper commented Aug 27, 2023

  1. @ggerganov have you considered keeping neuron activation statistics?

This could be used to prune (lobotomize) the model and remove unused "knowledge" to reduce the model size, required RAM and improve the inference performance.

The neuron usage statistics could be collected for a given set of use cases (eg. leave the model running in production for some months and then stick to that used knowledge only).

  1. other interesting approach that doesn't require lobotomizing the model would be to lazy-load the model weights dynamically by partitions:

When additional knowledge is required (when some sleeping neurons got activated) the model should load and connect those weight partitions.

Unused weight partitions could be removed from memory by disconnecting those unused neurons similar to how the OS manages cache.

@kripper
Copy link

kripper commented Aug 27, 2023

I see lazy loading has already been implemented in #613.

However, I believe this implementation is still loading and processing weights that may not contribute to the final inference result.

Hence, we should distinguish between "used weights" and "relevant weights."

The challenge here would be to dynamically detect which weight partitions will be relevant for the inference process.

One relatively simple initial approach would be:

  1. Keep statistics of relevant weights for a set of use cases, e.g., for English prompts only.
  2. Configure the model to load and use only those relevant weights using a map file.

A more complex approach would involve identifying multiple mappings of relevant weight partitions and dynamically detecting which weights will be required by the subsequent layers. In other words, the model weights would be grouped by "knowledge topics" that are loaded and used only when required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests