Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLM Support via GPTQ Hooks and Data Pipelines #914

Open
wants to merge 312 commits into
base: main
Choose a base branch
from

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Nov 13, 2024

Purpose

  • Enable oneshot quantization of vision-language models

VLM Banner
Llama_3 2-Vision Graphviz

Related Issues

Prerequisites

Changes

VLM Support

  • Add multimodal examples in examples/multimodal_vision
  • Modify custom_offload_device_map to support models which are not XForCausalLM
  • Add custom data collators for VLM models in src/llmcompressor/transformers/utils/data_collator.py

GPTQModifier

  • Implement hooks-based compression in GPTQModifier
    • This replaces layer-compressor, which made many assumptions about model architecture
    • This also enables finer-grained sequential compression such as true_sequential
    • Functions previously implemented in gptq_wrapper.py are now implemented in gptq_quantize.py
  • Implement offload_hessians parameter in GPTQModifier
  • Implement data-pipelines-based calibration in GPTQModifier
    • First an attempt will be made to trace the model and run the sequential pipeline
    • If that fails, assumptions will be made about the model architecture and an attempt will be made to run the layer_sequential pipeline
      • This ensures backwards compatibility with any previously supported models
    • If that fails, then the basic pipeline will be used, which is guaranteed to run but may require using offload_hessians

Data Pipelines

  • Implement the basic skeletons of data pipelines, which are subject to change when data pipelines are pulled out of modifiers
  • Basic Pipeline
    • Performs standard forward passes through the model with provided dataloader
    • Used as fallback, as well as in the future for basic calibration passes
  • Layer Sequential Pipeline
    • Refactor of LayerCompressor as a straight-forward data pipeline
    • Uses IntermediatesCache to handle activation offloading
  • Sequential Pipeline
    • Utilizes graph tracing implemented by torch.fx to trace the graph in order to determine where sequential targets (layers) exist in the graph and what their inputs and outputs are
    • Implements BFS algorithm to assign nodes to partitions
      • An ideal implementation consolidates partition indices to assign each node to the latest possible partition, delaying execution. The current implementation addresses the most common case (node.op == get_attr)
    • Each partition (Subgraph) is compiled as an executable python function with the proper inputs and outputs
    • Uses IntermediatesCache to handle activation offloading
  • Implement IntermediatesCache which automagically handles the offloading and onloading of activations from batches
    • This class is capable of offloading many non-standard activation types such as Tuples and dataclasses such as BaseModelOutputWithPast
    • For convenience, the class also handles masking padding
    • The class is tested in tests/llmcompressor/pipelines/test_cache.py

Tracing

  • In order to support sequential quantization of the large variety of different multimodal model architectures, some model definitions have to be altered to support tracing
    • If the calibration dataset is text only, most LLMs and VLMs are traceable without additional work. Multimodal calibration datasets are more likely to require additional work to make tracable
    • For many VLMs (but not all), the vision tower is not traceable without significant work. However, this only affects sequential error propagation and (minimal?) increased memory usage, which leaves the door open for future support for quantizing modules in the vision tower
  • Add traceable model definitions for llava, mistral, mllama, and glm
  • All copyright licenses allow for alteration and redistribution, the line # vllm-project: no copyright was added in similar style to text_generation.py

Miscellaneous

  • Slight performance improvement to apply_pad_mask_to_batch
  • Support inhomogenous GPUS in custom_offload_device_map

Future Work/ Follow ups

Evaluations

Model Dataset Runtime Winogrande
Llama-3-8B ultrachat 43m, 2xA4000 0.7545
Llama-3-70B ultrachat 303m, 1xH100 0.8216
Mixtral ultrachat  
openbmb/MiniCPM3-4B ultrachat 63m, 1xA100 0.6701 
Qwen2-VL-2B-Instruct ultrachat 12m, 2xA4000 0.6188 
Qwen2-VL-2B-Instruct flickr 24m, 2xA4000 0.6093 
Llama-3.2-11B-Vision-Instruct flickr 75m, 1xA100 0.7837
Pixtral-12B-2409 flickr 52m, 1xA100 0.7924
llava-1.5-7b-hf flickr 15m, 1xH100 0.7214 
Phi-3-vision-128k-instruct flickr  

lm_eval --model vllm --model_args pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True --tasks winogrande --num_fewshot 5 --batch_size 32
lm_eval --model vllm --model_args pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1 --tasks winogrande --num_fewshot 5 --batch_size 1

Testing

@kylesayrs kylesayrs mentioned this pull request Nov 14, 2024
@kylesayrs kylesayrs changed the base branch from main to kylesayrs/cleanup-custom-dataset December 23, 2024 20:46
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Base automatically changed from kylesayrs/cleanup-custom-dataset to main December 24, 2024 01:59
@dsikka
Copy link
Collaborator

dsikka commented Dec 28, 2024

Is this ready for review?
If yes, can we update the PR description, address failing checks, show testing results, etc?

@kylesayrs
Copy link
Collaborator Author

@kylesayrs kylesayrs changed the title VLM Support via GPTQ Hooks and Sequential Data Pipeline VLM Support via GPTQ Hooks and Data Pipelines Dec 31, 2024
dsikka and others added 3 commits December 31, 2024 21:25
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
@kylesayrs
Copy link
Collaborator Author

kylesayrs commented Jan 2, 2025

@dsikka The PR description has been updated and the PR is ready for review. I will continue to add model evaluations to the description as they are completed.

@kylesayrs kylesayrs requested review from mgoin and dsikka January 2, 2025 04:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants