Add exllama GPTQ CUDA kernel support #553

fxmarty · 2023-07-05T16:19:49Z

Examples:

text-generation-launcher --model-id Narsil/starcoder-gptq --port 8080 --num-shard 2 --quantize gptq
GPTQ_GROUPSIZE=128 GPTQ_BITS=4 text-generation-launcher --model-id TheBloke/WizardLM-7B-uncensored-GPTQ --port 8080 --num-shard 2 --quantize gptq
pytest integration-tests/models/test_flash_llama_gptq.py -s
pytest integration-tests/models/test_flash_starcoder_gptq.py -s

This PR adds to TGI the mixed precision int4/fp16 kernels from the excellent exllama repo, that from my benchmark is much better than the implementations available in autogptq & gptq-for-llama.

On batch size 1, for starcoder with starcoder & GPTQ-4bit-no-actorder, we get a x2.1 speedup on the prefill over GPTQ-triton, and x1.8 speedup on the decode over GPTQ-triton. I'll have a look at the peak memory.

I verified locally that logits match.

Note that exllama implementation can not be used with act-order & tp rank>=2 for row tensor parallel linear, because exllama reorders weights ahead of runtime, requiring to reorder the activation as well (which are split on several GPUs for row parallel + TP rank>=2). In this specific case, we default to the trition implementation (that is much slower because reordering is done one the scales/zero points, and each weight row need to have its own specific scale/zero point).

Exllama implementation is specifically for n_bits = 4. Thus, for the other cases we fall back on the triton kernel.

Results on starcoder are as follow (TP rank = 2, A100, before vllm):

Parameter	Value
Model	Narsil/starcoder-gptq
Sequence Length	512
Decode Length	512
N Runs	10
Warmups	1
Temperature	None
Top K	None
Top P	None
Typical P	None
Repetition Penalty	None
Watermark	false
Do Sample	false

GPTQ (current):

Step	Batch Size	Average	Lowest	Highest	p50	p90	p99
Prefill	1	139.00 ms	138.31 ms	139.93 ms	139.06 ms	139.93 ms	139.93 ms
	4	522.17 ms	521.71 ms	522.57 ms	522.16 ms	522.57 ms	522.57 ms
	8	1016.23 ms	1015.78 ms	1016.87 ms	1016.10 ms	1016.87 ms	1016.87 ms
Decode (token)	1	36.19 ms	36.07 ms	36.54 ms	36.11 ms	36.54 ms	36.54 ms
	4	36.65 ms	36.57 ms	36.74 ms	36.64 ms	36.74 ms	36.74 ms
	8	36.72 ms	36.55 ms	36.94 ms	36.72 ms	36.94 ms	36.94 ms
Decode (total)	1	18491.31 ms	18433.56 ms	18671.24 ms	18451.02 ms	18671.24 ms	18671.24 ms
	4	18728.41 ms	18689.53 ms	18776.16 ms	18724.24 ms	18776.16 ms	18776.16 ms
	8	18762.84 ms	18678.29 ms	18875.27 ms	18763.06 ms	18875.27 ms	18875.27 ms

Step	Batch Size	Average	Lowest	Highest
Prefill	1	7.19 tokens/secs	7.15 tokens/secs	7.23 tokens/secs
	4	7.66 tokens/secs	7.65 tokens/secs	7.67 tokens/secs
	8	7.87 tokens/secs	7.87 tokens/secs	7.88 tokens/secs
Decode	1	27.64 tokens/secs	27.37 tokens/secs	27.72 tokens/secs
	4	109.14 tokens/secs	108.86 tokens/secs	109.37 tokens/secs
	8	217.88 tokens/secs	216.58 tokens/secs	218.86 tokens/secs

GPTQ-CUDA (exllama):

Step	Batch Size	Average	Lowest	Highest	p50	p90	p99
Prefill	1	65.49 ms	64.76 ms	65.81 ms	65.58 ms	65.81 ms	65.81 ms
	4	190.36 ms	189.25 ms	194.22 ms	190.25 ms	194.22 ms	194.22 ms
	8	350.45 ms	349.56 ms	353.83 ms	350.07 ms	353.83 ms	353.83 ms
Decode (token)	1	19.69 ms	18.90 ms	21.20 ms	18.99 ms	21.20 ms	21.20 ms
	4	30.51 ms	30.42 ms	30.64 ms	30.49 ms	30.58 ms	30.58 ms
	8	34.73 ms	34.56 ms	34.80 ms	34.76 ms	34.80 ms	34.80 ms
Decode (total)	1	10061.35 ms	9659.25 ms	10835.27 ms	9705.76 ms	10835.27 ms	10835.27 ms
	4	15592.28 ms	15547.03 ms	15659.89 ms	15582.29 ms	15626.84 ms	15626.84 ms
	8	17749.09 ms	17661.60 ms	17781.76 ms	17760.34 ms	17781.76 ms	17781.76 ms

Step	Batch Size	Average	Lowest	Highest
Prefill	1	15.27 tokens/secs	15.20 tokens/secs	15.44 tokens/secs
	4	21.01 tokens/secs	20.59 tokens/secs	21.14 tokens/secs
	8	22.83 tokens/secs	22.61 tokens/secs	22.89 tokens/secs
Decode	1	50.91 tokens/secs	47.16 tokens/secs	52.90 tokens/secs
	4	131.09 tokens/secs	130.52 tokens/secs	131.47 tokens/secs
	8	230.32 tokens/secs	229.90 tokens/secs	231.46 tokens/secs

Before submitting

Doc ==> There's no doc currently right?
Tests ==> Done for starcoder and llama

…eneration-inference into gptq-cuda-kernels

Narsil

Neat numbers !

I feel like gptq and gptq-cuda is not necessary here.

IIUC, both can run on the same weights (as you didn't change the conversion script).
Therefore, we could simply use exllama kernel whenever available (when g_idx is increasing).

That should simplify the codebase a lot.

Also nothing should be modified in model files, everything should be very agnostic to it, especially since the weights are exactly the same on disk.

This could also be explained in the gptq script (act-order, if True, more precise models, but slower inference because different kernels, if False, lower precision, but ffaster inference).

I still fail to understand why we cannot reorder on load to use exllama for act-order (since we can reslice at will in the original tensors, we could probably de-entagle g_idx again.

It's a lot more work certainly.

Narsil · 2023-07-06T18:50:18Z

server/text_generation_server/models/custom_modeling/flash_santacoder_modeling.py

+        # Buffers need to be persistent to avoid any bug.
+        self.buffers = {}
+        if config.quantize == "gptq-cuda":
+            max_dq_buffer_size = 0
+            for name, submodule in self.named_modules():
+                if isinstance(submodule, (TensorParallelColumnLinear, TensorParallelRowLinear)) and isinstance(submodule.linear, Ex4bitLinear):
+                    max_dq_buffer_size = max(max_dq_buffer_size, submodule.linear.qweight.numel() * 8)
+
+            intermediate_size = config.n_inner
+            max_seq_len = 2048  # TODO: we should be able to set it
+
+            self.buffers["temp_state"] = torch.zeros((max_seq_len, intermediate_size), dtype=torch.float16, device=weights.device)
+            self.buffers["temp_dq"] = torch.zeros((1, max_dq_buffer_size), dtype=torch.float16, device=weights.device)
+
+            prepare_buffers(weights.device, self.buffers["temp_state"], self.buffers["temp_dq"])
+
+            # TODO: ability to set them
+            matmul_recons_thd = 8
+            matmul_fused_remap = False
+            matmul_no_half2 = False
+            set_tuning_params(matmul_recons_thd, matmul_fused_remap, matmul_no_half2)
+
+            torch.cuda.empty_cache()
+


I think this should go directly in the loading part (within weights).
That ways it's truly agnostic to models.

I moved it to Model init. This requires to have access to model.config which is currently not defined though. There is model.transformer.config, or model.gpt_neox.config or model.model.config depending on the architecture. Is it intended that the config is not registered at the top level? @OlivierDehaene @Narsil

The thing is that the weights = Weights(...) call is in each model definition, and we need to have loaded all weights to determine the shapes of the buffers. Also, the buffers need to be persistent, while I think this weights object is not.

The buffers intend to be shared, no ?

So why not just have a single location for these buffers, use the pointer on every layer, and increase the size every time max_dq_buffer_size = max(max_dq_buffer_size, submodule.linear.qweight.numel() * 8) is larger ?

The issue with this (and any post loading treatment) is that you're now dealing with updating every single model file, any time of those line hits. This is what we had before and it was painful to maintain.

This seems to be used as globals let's just use them as globals. (They are temporary buffers IIUC preallocated to avoid reallocating them all the time)

server/text_generation_server/models/custom_modeling/flash_santacoder_modeling.py

server/text_generation_server/utils/layers.py

server/text_generation_server/utils/weights.py

server/text_generation_server/models/custom_modeling/flash_santacoder_modeling.py

fxmarty · 2023-07-12T18:45:20Z

For some reason the logprob slightly differ between different runs, there's a source of randomicity I've still yet not identified.

Edit: comes from the atomicAdd of the kernel - this is fine.

I'll add llama support in this PR too.

Narsil · 2023-07-14T15:37:17Z

atomicAdd + randomicity

This is very suspicious, really ? Isn't the purpose of atomicAdd to remove randomness by forcing access order ? :)
It may well be acceptable though :)

fxmarty · 2023-07-17T08:15:41Z

@Narsil I don't believe it is suspicious: https://forums.developer.nvidia.com/t/get-different-results-for-every-running-with-atomicadd/229649/2

See turboderp/exllama#153

Narsil · 2023-07-18T09:09:39Z

Ahhh that level of randomness ! :) I see, yeah totally legit source of "randomness".

fxmarty · 2023-07-19T18:34:55Z

Some tests up to date with main on llama 2 70b

@OlivierDehaene

Just trying to get the integration tests to pass. # What does this PR do?   Fixes # (issue) ## Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#start-contributing-pull-requests), Pull Request section? - [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? ## Who can review? Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.  --------- Co-authored-by: Felix Marty <9808326+fxmarty@users.noreply.github.com>

Narsil · 2023-07-21T09:00:25Z

Closing as superseeded by #666

fxmarty added 3 commits July 5, 2023 15:43

add exllama gptq kernel

ee7ba48

add attribution

c858d79

Merge branch 'main' into gptq-cuda-kernels

0ff8219

fxmarty requested a review from Narsil July 5, 2023 16:37

fxmarty added 3 commits July 5, 2023 16:42

some more cleanup

2272b3a

Merge branch 'gptq-cuda-kernels' of https://github.com/fxmarty/text-g…

620ed7d

…eneration-inference into gptq-cuda-kernels

try-catch to load the cuda extension, quite ugly practice tbh

a6e3874

Narsil reviewed Jul 6, 2023

View reviewed changes

server/text_generation_server/models/custom_modeling/flash_santacoder_modeling.py Outdated Show resolved Hide resolved

fxmarty added 7 commits July 12, 2023 15:43

have a single gptq quantization type

4462854

move exllama buffer init to the top level

67a46b7

cleanup

67d6876

support bits different than 4

f90c61a

tests

8645fd3

Merge branch 'main' into gptq-cuda-kernels

faa5b52

fix test

38c2be5

fxmarty requested a review from Narsil July 12, 2023 18:44

fxmarty added 2 commits July 13, 2023 10:38

fix tests

2ae65b4

support all, test llama

0036084

fxmarty requested a review from OlivierDehaene July 13, 2023 15:43

fxmarty added 2 commits July 13, 2023 17:45

Merge branch 'main' into gptq-cuda-kernels

9401e10

fix the usual merge mess

74e6d6e

Narsil mentioned this pull request Jul 14, 2023

GPTQ Formats that work (and don't) #601

Closed

fxmarty added 2 commits July 19, 2023 16:58

Merge branch 'main' into gptq-cuda-kernels

edfbfdf

fix per-column quantization

6bf7090

Narsil added 3 commits July 20, 2023 17:38

Refactored a bit.

0860394

Small polish.

8cf7c89

Give escape hatch to not use exllama kernels even if available.

7faef69

Narsil previously approved these changes Jul 20, 2023

View reviewed changes

Fixing GTPQ device santacoder.

900ac49

Narsil dismissed their stale review via 900ac49 July 20, 2023 19:08

Narsil added 4 commits July 20, 2023 19:56

Fix config.

12191b7

Add kernel target.

c6e702f

Separate build process.

3ec3add

Update starcoder_gptq

40be532

Narsil closed this Jul 21, 2023

fxmarty mentioned this pull request Jul 21, 2023

Add GPTQ Quantization huggingface/optimum#1216

Merged

5 tasks

fxmarty mentioned this pull request Jul 31, 2023

Add exllama q4 kernel AutoGPTQ/AutoGPTQ#219

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add exllama GPTQ CUDA kernel support #553

Add exllama GPTQ CUDA kernel support #553

fxmarty commented Jul 5, 2023 •

edited

Loading

Narsil left a comment

Narsil Jul 6, 2023

fxmarty Jul 12, 2023

Narsil Jul 12, 2023

fxmarty commented Jul 12, 2023 •

edited

Loading

Narsil commented Jul 14, 2023

fxmarty commented Jul 17, 2023

Narsil commented Jul 18, 2023

fxmarty commented Jul 19, 2023

Narsil commented Jul 21, 2023

Add exllama GPTQ CUDA kernel support #553

Add exllama GPTQ CUDA kernel support #553

Conversation

fxmarty commented Jul 5, 2023 • edited Loading

Before submitting

Narsil left a comment

Choose a reason for hiding this comment

Narsil Jul 6, 2023

Choose a reason for hiding this comment

fxmarty Jul 12, 2023

Choose a reason for hiding this comment

Narsil Jul 12, 2023

Choose a reason for hiding this comment

fxmarty commented Jul 12, 2023 • edited Loading

Narsil commented Jul 14, 2023

fxmarty commented Jul 17, 2023

Narsil commented Jul 18, 2023

fxmarty commented Jul 19, 2023

Narsil commented Jul 21, 2023

fxmarty commented Jul 5, 2023 •

edited

Loading

fxmarty commented Jul 12, 2023 •

edited

Loading