VLM Support via GPTQ Hooks and Data Pipelines #914

kylesayrs · 2024-11-13T21:35:23Z

Purpose

Enable oneshot quantization of vision-language models

Llama_3 2-Vision Graphviz

Related Issues

Fixes Layers not skipped with ignore=[ "re:.*"] #91
Fixes Gptj use gptq quant bug #961
Related to Quantize glm-4v-9b with INT8 Quantization #1003
Fixes When llava-v1.6 models w8a8_int8 quantization can be supported? #990

Prerequisites

Changes

VLM Support

Add multimodal examples in examples/multimodal_vision
Modify custom_offload_device_map to support models which are not XForCausalLM
Add custom data collators for VLM models in src/llmcompressor/transformers/utils/data_collator.py

GPTQModifier

Implement hooks-based compression in GPTQModifier
- This replaces layer-compressor, which made many assumptions about model architecture
- This also enables finer-grained sequential compression such as true_sequential
- Functions previously implemented in gptq_wrapper.py are now implemented in gptq_quantize.py
Implement offload_hessians parameter in GPTQModifier
Implement data-pipelines-based calibration in GPTQModifier
- First an attempt will be made to trace the model and run the sequential pipeline
- If that fails, assumptions will be made about the model architecture and an attempt will be made to run the layer_sequential pipeline
  - This ensures backwards compatibility with any previously supported models
- If that fails, then the basic pipeline will be used, which is guaranteed to run but may require using offload_hessians

Data Pipelines

Implement the basic skeletons of data pipelines, which are subject to change when data pipelines are pulled out of modifiers
Basic Pipeline
- Performs standard forward passes through the model with provided dataloader
- Used as fallback, as well as in the future for basic calibration passes
Layer Sequential Pipeline
- Refactor of LayerCompressor as a straight-forward data pipeline
- Uses IntermediatesCache to handle activation offloading
Sequential Pipeline
- Utilizes graph tracing implemented by torch.fx to trace the graph in order to determine where sequential targets (layers) exist in the graph and what their inputs and outputs are
- Implements BFS algorithm to assign nodes to partitions
  - An ideal implementation consolidates partition indices to assign each node to the latest possible partition, delaying execution. The current implementation addresses the most common case (node.op == get_attr)
- Each partition (Subgraph) is compiled as an executable python function with the proper inputs and outputs
- Uses IntermediatesCache to handle activation offloading
Implement IntermediatesCache which automagically handles the offloading and onloading of activations from batches
- This class is capable of offloading many non-standard activation types such as Tuples and dataclasses such as BaseModelOutputWithPast
- For convenience, the class also handles masking padding
- The class is tested in tests/llmcompressor/pipelines/test_cache.py

Tracing

In order to support sequential quantization of the large variety of different multimodal model architectures, some model definitions have to be altered to support tracing
- If the calibration dataset is text only, most LLMs and VLMs are traceable without additional work. Multimodal calibration datasets are more likely to require additional work to make tracable
- For many VLMs (but not all), the vision tower is not traceable without significant work. However, this only affects sequential error propagation and (minimal?) increased memory usage, which leaves the door open for future support for quantizing modules in the vision tower
Add traceable model definitions for llava, mistral, mllama, and glm
All copyright licenses allow for alteration and redistribution, the line # vllm-project: no copyright was added in similar style to text_generation.py

Miscellaneous

Slight performance improvement to apply_pad_mask_to_batch
Support inhomogenous GPUS in custom_offload_device_map

Future Work/ Follow ups

VLM: TracableQwen2VLForConditionalGeneration #1027
VLM: Model Tracing Guide #1030
Create better data collators capable of handling larger batch sizes in order to support VLM fine tuning
Better support prompt masking for multimodal processors in order to support VLM fine tuning

Evaluations

Model	Dataset	Runtime	Winogrande
Llama-3-8B	ultrachat	43m, 2xA4000	0.7545
Llama-3-70B	ultrachat	303m, 1xH100	0.8216
Mixtral	ultrachat
openbmb/MiniCPM3-4B	ultrachat	63m, 1xA100	0.6701
Qwen2-VL-2B-Instruct	ultrachat	12m, 2xA4000	0.6188
Qwen2-VL-2B-Instruct	flickr	24m, 2xA4000	0.6093
Llama-3.2-11B-Vision-Instruct	flickr	75m, 1xA100	0.7837
Pixtral-12B-2409	flickr	52m, 1xA100	0.7924
llava-1.5-7b-hf	flickr	15m, 1xH100	0.7214
Phi-3-vision-128k-instruct	flickr

lm_eval --model vllm --model_args pretrained="path/to/model",dtype=auto,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True --tasks winogrande --num_fewshot 5 --batch_size 32
lm_eval --model vllm --model_args pretrained="path/to/model",dtype=bfloat16,max_model_len=4096,tensor_parallel_size=1,gpu_memory_utilization=0.8,enforce_eager=True,add_bos_token=True,max_num_seqs=1 --tasks winogrande --num_fewshot 5 --batch_size 1