Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed changes to reduce VRAM usage. Potentially quantize larger models on consumer hardware. #269

Open
sigmareaver opened this issue Jun 25, 2023 · 3 comments

Comments

@sigmareaver
Copy link

sigmareaver commented Jun 25, 2023

Hello everyone,

Recently I noticed a lack of 4-bit quantized versions of Google/flan-ul2 on HF, and so, decided to set out to quantize the model on my 4090.

I struggled with this for a bit due to my own oversights, but during my debugging, I found a section of code that, if run on the CPU, does NOT take long to run, and yet, if run on the GPU, requires a significant allocation of memory. Thus I propose the following changes inside of t5_sequential() in the t5.py script file:

        del layer
        del gptq 
        gc.collect()
        torch.cuda.empty_cache()

        inps, outs = outs, inps
        
    # do this part on CPU, because GPU runs out of memory
    dev = 'cpu'

    model.encoder.final_layer_norm = model.encoder.final_layer_norm.to(dev)
    model.encoder.dropout = model.encoder.dropout.to(dev)
    
    encoder_hidden_states = model.encoder.final_layer_norm(inps.cpu())
    encoder_hidden_states = model.encoder.dropout(encoder_hidden_states)
    
    model.encoder.final_layer_norm = model.encoder.final_layer_norm.cpu()
    model.encoder.dropout = model.encoder.dropout.cpu()

    dev = 'cuda:0'
    encoder_hidden_states = encoder_hidden_states.to(dev)
    inps = inps.to(dev)
    # end of CPU section

At the moment, I'm not actually certain whether or not using gc.collect() is entirely necessary, but it also doesn't hurt. The real memory saving changes come from running model.encoder.final_layer_norm and model.encoder.dropout on the CPU. This (combined with --n_samples 256 and PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512) enabled me to quantize a 20b model on my 4090 where I otherwise would not have been able to. Or at least, not with that size of sample count.

Note: Use of gc.collect() requires adding import gc somewhere (preferably near the top) in the file.

@xXwatermelon
Copy link

neat

@sigjhl
Copy link

sigjhl commented Jul 12, 2023

Hello everyone,

Recently I noticed a lack of 4-bit quantized versions of Google/flan-ul2 on HF, and so, decided to set out to quantize the model on my 4090.

I struggled with this for a bit due to my own oversights, but during my debugging, I found a section of code that, if run on the CPU, does NOT take long to run, and yet, if run on the GPU, requires a significant allocation of memory. Thus I propose the following changes inside of t5_sequential() in the t5.py script file:

        del layer
        del gptq 
        gc.collect()
        torch.cuda.empty_cache()

        inps, outs = outs, inps
        
    # do this part on CPU, because GPU runs out of memory
    dev = 'cpu'

    model.encoder.final_layer_norm = model.encoder.final_layer_norm.to(dev)
    model.encoder.dropout = model.encoder.dropout.to(dev)
    
    encoder_hidden_states = model.encoder.final_layer_norm(inps.cpu())
    encoder_hidden_states = model.encoder.dropout(encoder_hidden_states)
    
    model.encoder.final_layer_norm = model.encoder.final_layer_norm.cpu()
    model.encoder.dropout = model.encoder.dropout.cpu()

    dev = 'cuda:0'
    encoder_hidden_states = encoder_hidden_states.to(dev)
    inps = inps.to(dev)
    # end of CPU section

At the moment, I'm not actually certain whether or not using gc.collect() is entirely necessary, but it also doesn't hurt. The real memory saving changes come from running model.encoder.final_layer_norm and model.encoder.dropout on the CPU. This (combined with --n_samples 256 and PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512) enabled me to quantize a 20b model on my 4090 where I otherwise would not have been able to. Or at least, not with that size of sample count.

Note: Use of gc.collect() requires adding import gc somewhere (preferably near the top) in the file.

Hey, thank you for your flan-ul2 quant! I really appreciate it. Very useful model, but unfortunately overlooked due to the flood of llamas..

Seeing your work, I tried to quantize a flan-t5-xxl but the resulting file outputs nonsense.
Also, while quantizing the error reported per layer seems large (10s~100s) although I am not familiar with the normal range.
I tried to monkey patch here and there, but as a novice in programming I am at my wits end.

Can you please point me in the right direction, or share your modifications to the t5 branch?

@sigmareaver
Copy link
Author

Hey, thank you for your flan-ul2 quant! I really appreciate it. Very useful model, but unfortunately overlooked due to the flood of llamas..

Seeing your work, I tried to quantize a flan-t5-xxl but the resulting file outputs nonsense. Also, while quantizing the error reported per layer seems large (10s~100s) although I am not familiar with the normal range. I tried to monkey patch here and there, but as a novice in programming I am at my wits end.

Can you please point me in the right direction, or share your modifications to the t5 branch?

Aside from the change listed in this issue, I've made no other change. I don't know why your quantized model isn't working really, but according to the config.json files, flan-ul2 is bfloat16 while flan-t5-xxl is float32. I'm not sure if this makes a difference or not, but it may. I also don't know which flan-t5-xxl model you downloaded. The repo has several formats it looks like. I'd recommend only getting the bin files and double check their SHA256 against what is listed on HF. Sorry I couldn't be more help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants