Per Layer Streaming Quantization #655

msaroufim · 2024-08-12T00:35:20Z

A tradeoff users have often complained, most recently @aredden about is either they

quantize on CPU and then push the model to GPU -> Slow quantization but VRAM efficient
Push to model to GPU and then quantize on GPU -> Fast quantization but needs lots of VRAM

Instead we could have a utility that sends one layer at a time to the gpu, quantizes it and then sends in a new layer synchronously. Granted this workflow seems to interact in a clunky way with torch.compile where we don't compile things layer wise and generally expect the model to be on the device where its compiled

The text was updated successfully, but these errors were encountered:

aredden · 2024-08-12T01:27:40Z

This would be great, since currently if I want to run a large model, and cannot load it onto my device without running into an OOM- I have to individually push weights to device, compile their modules, and then quantize them. Though yeah, it would be great if there were a way to iteratively push, compile and then quantize, since otherwise I miss out on possible optimizations.

gau-nernst · 2024-08-12T02:02:24Z

Quantization on CPU can be significantly faster if we use torch.compile(). However, from previous discussions, we don't want to do that as it is harder to debug #315 (comment).

We can explore compiling the function that performs quantization. Does it work? If it works, does it actually save time, due to first compile overhead, as well as potential re-compiles?

gau-nernst · 2024-08-13T03:48:18Z

Was looking through Llama model code in torchao and came across nn.Module._register_load_state_dict_pre_hook(). I rmb last time I had some ideas utilizing this for memory-efficient quantization too. Something like (haven't tested):

with torch.device("meta"):
    model = ...
    quantize_(model, ...)
model.to_empty(device="cuda")  # materialize quantized model on CUDA

def hook(state_dict, prefix):
    # move original weight to CUDA and quantize it here
    ...

handles = []
for m in model.modules():
    if isinstance(m, nn.Linear):
        handles.append(m._register_load_state_dict_pre_hook(hook))

# weight is quantized on load
model.load_state_dict(state_dict)

# remove hooks
for handle in handles:
    handle.remove()

gau-nernst · 2024-08-13T12:23:37Z

On a second thought, it doesn't need to be that complicated.. Because we are already iterating over each module when doing quantization, we can simply add a device argument, and move the matching module (i.e. nn.Linear) to the desired device (i.e. CUDA) before applying the quantization function.

ao/torchao/quantization/quant_api.py

Lines 171 to 173 in e7fc0ed

    
           if filter_fn(model, cur_fqn[:-1]): 
        
               model = replacement_fn(model) 
        
               return model

This means that the full original model is in CPU RAM. The convoluted solution in my previous reply can be used with mmap to further reduce RAM usage (i.e. don't materialize full model in CPU RAM).

msaroufim added the good first issue Good for newcomers label Aug 12, 2024

gau-nernst mentioned this issue Aug 17, 2024

Add option to move param to device before quantization #699

Merged

msaroufim closed this as completed in #699 Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per Layer Streaming Quantization #655

Per Layer Streaming Quantization #655

msaroufim commented Aug 12, 2024 •

edited

Loading

aredden commented Aug 12, 2024

gau-nernst commented Aug 12, 2024

gau-nernst commented Aug 13, 2024 •

edited

Loading

gau-nernst commented Aug 13, 2024

Per Layer Streaming Quantization #655

Per Layer Streaming Quantization #655

Comments

msaroufim commented Aug 12, 2024 • edited Loading

aredden commented Aug 12, 2024

gau-nernst commented Aug 12, 2024

gau-nernst commented Aug 13, 2024 • edited Loading

gau-nernst commented Aug 13, 2024

msaroufim commented Aug 12, 2024 •

edited

Loading

gau-nernst commented Aug 13, 2024 •

edited

Loading