85%+ of the llama model could be redundant #989

teaalltr · 2023-04-14T21:29:29Z

Turns out that most LLM parameters are redundant, see https://aclanthology.org/2020.emnlp-main.398.pdf.
They run the experiment with BERT and XLNet. Code for the pruning is provided.
There's lots of room for improvement apparently, since LLama is very similar to those. If someone's interested, that could be a nice thing to try 😄

Azeirah · 2023-04-15T11:31:04Z

https://github.com/fdalvi/analyzing-redundancy-in-pretrained-transformer-models

Azeirah · 2023-04-15T11:42:42Z

I'm just skimming through the paper quickly, I'm no AI expert whatsoever, but I do think I see one problem: the datasets they use to determine reduncancy focuses on English only, LLaMa supports a lot more languages that just English, including coding languages which would be pruned completely with their datasets.

I don't think this is necessarily a show-stopping problem, it just means that in order to use this technique for LLaMa we'd need more datasets specifically suited for LLaMa. If you were to use their original dataset, LLaMa would become an English-only LLM.

3.1 To analyze the general redundancy in pre-trained
models, we use the Penn Treebank development
set (Marcus et al., 1993), which consists of roughly
44,000 tokens. For task-specific analysis, we use
two broad categories of downstream tasks – Se-
quence Labeling and Sequence Classification tasks.
For the sequence labeling tasks, we study core
linguistic tasks, i) part-of-speech (POS) tagging
using the Penn TreeBank, ii) CCG super tagging
using CCGBank (Hockenmaier, 2006), iii) seman-
tic tagging (SEM) using Parallel Meaning Bank
data (Abzianidze and Bos, 2017) and iv) syn-
tactic chunking using CoNLL 2000 shared task
dataset (Sang and Buchholz, 2000).

The penn treebank development set

Building a large annotated corpus of English: the Penn Treebank

jon-chuang · 2023-04-16T01:35:39Z

See also SparseGPT/LLaMa https://github.com/lachlansneff/sparsellama, https://arxiv.org/abs/2301.00774

cmp-nct · 2023-04-16T15:23:13Z

I didn't dig into it yet just my 10 Cents: llama uses SwiGLU, Bert Gelu and others Relu.
SwiGLU seems a super-heavy activation to me which also increases the perceived neuronal density of the network.

Next is the already mentioned factor: if you use a limited training set then you basicaly lobotomize all areas you didn't train at all and you damage those areas you did not train enough and you'll remove a lot of the nuances the model learned making it more "pragmatic" and less "creative".
I agree that this type of optimization is interesting but it comes with not so trivial consequences and complications.

Green-Sky · 2023-04-16T15:42:36Z

@Azeirah

the datasets they use to determine reduncancy focuses on English only, LLaMa supports a lot more languages that just English,

this might be true, but it would still be beneficial for specific use cases. (eg english only)

the paper states for llama states:

... performs language identification with a fastText linear classifier to remove non-English pages and filters low quality content with an n-gram language model.

languages, which use either the Latin or Cyrillic
scripts: bg, ca, cs, da, de, en, es, fr, hr, hu, it,
nl, pl, pt, ro, ru, sl, sr, sv, uk.

netsvetaev · 2023-04-17T23:24:41Z

just to mention: other Latin languages are very similar to English (same Latin word roots) and that other Cyrillic languages allow the model to be used as a simple translation engine (and make it more interesting for other countries and more people in general).

I believe you can’t just remove it without any damage to the model.

xloem · 2023-04-24T11:37:48Z

I wonder if pruning can be kind of like a form of finetuning where the resulting model is much smaller. For example, what if one pruned using the instruct data people are finetuning on? Or using the output distributions of a larger model as in knowledge distillation? In the latter case the model could possibly increase in strength rather than decrease.

Today I found there is structured LLaMA pruning code at https://github.com/horseee/LLaMA-Pruning and https://github.com/VainF/Torch-Pruning .

Note that high-quality pruning and quantization finetunes the model during the optimization to reduce the impact on performance. The above approach does not appear to do that.

Alumniminium · 2023-04-27T11:36:50Z

Has anyone pruned alpaca/vicuna and uploaded it somewhere?

ivanstepanovftw · 2023-04-28T22:32:14Z

There is also knowledge distillation technique. I would like to se if someone would distill 65B into 7B model.

xloem · 2023-07-28T21:40:34Z

what’s the sparsification news? (or was this issue closed inaccurately?)

kripper · 2023-08-27T11:04:31Z

@ggerganov have you considered keeping neuron activation statistics?

This could be used to prune (lobotomize) the model and remove unused "knowledge" to reduce the model size, required RAM and improve the inference performance.

The neuron usage statistics could be collected for a given set of use cases (eg. leave the model running in production for some months and then stick to that used knowledge only).

other interesting approach that doesn't require lobotomizing the model would be to lazy-load the model weights dynamically by partitions:

When additional knowledge is required (when some sleeping neurons got activated) the model should load and connect those weight partitions.

Unused weight partitions could be removed from memory by disconnecting those unused neurons similar to how the OS manages cache.

kripper · 2023-08-27T11:55:25Z

I see lazy loading has already been implemented in #613.

However, I believe this implementation is still loading and processing weights that may not contribute to the final inference result.

Hence, we should distinguish between "used weights" and "relevant weights."

The challenge here would be to dynamically detect which weight partitions will be relevant for the inference process.

One relatively simple initial approach would be:

Keep statistics of relevant weights for a set of use cases, e.g., for English prompts only.
Configure the model to load and use only those relevant weights using a map file.

A more complex approach would involve identifying multiple mappings of relevant weight partitions and dynamically detecting which weights will be required by the subsequent layers. In other words, the model weights would be grouped by "knowledge topics" that are loaded and used only when required.

ggerganov closed this as completed Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

85%+ of the llama model could be redundant #989

85%+ of the llama model could be redundant #989

teaalltr commented Apr 14, 2023

Azeirah commented Apr 15, 2023

Azeirah commented Apr 15, 2023

jon-chuang commented Apr 16, 2023

cmp-nct commented Apr 16, 2023

Green-Sky commented Apr 16, 2023

netsvetaev commented Apr 17, 2023 •

edited

Loading

xloem commented Apr 24, 2023 •

edited

Loading

Alumniminium commented Apr 27, 2023

ivanstepanovftw commented Apr 28, 2023

xloem commented Jul 28, 2023

kripper commented Aug 27, 2023 •

edited

Loading

kripper commented Aug 27, 2023 •

edited

Loading

85%+ of the llama model could be redundant #989

85%+ of the llama model could be redundant #989

Comments

teaalltr commented Apr 14, 2023

Azeirah commented Apr 15, 2023

Azeirah commented Apr 15, 2023

jon-chuang commented Apr 16, 2023

cmp-nct commented Apr 16, 2023

Green-Sky commented Apr 16, 2023

netsvetaev commented Apr 17, 2023 • edited Loading

xloem commented Apr 24, 2023 • edited Loading

Alumniminium commented Apr 27, 2023

ivanstepanovftw commented Apr 28, 2023

xloem commented Jul 28, 2023

kripper commented Aug 27, 2023 • edited Loading

kripper commented Aug 27, 2023 • edited Loading

netsvetaev commented Apr 17, 2023 •

edited

Loading

xloem commented Apr 24, 2023 •

edited

Loading

kripper commented Aug 27, 2023 •

edited

Loading

kripper commented Aug 27, 2023 •

edited

Loading