Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/main' into CSY/new-ci
Browse files Browse the repository at this point in the history
# Conflicts:
#	.github/workflows/unit_tests.yml
  • Loading branch information
CSY-ModelCloud committed Dec 24, 2024
2 parents 34e4ff2 + 4197cd8 commit 5d0fc04
Show file tree
Hide file tree
Showing 9 changed files with 51 additions and 34 deletions.
12 changes: 7 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,18 +9,20 @@
</p>

## News
* 12/23/2024 [1.5.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.5.0): Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized.
* 12/19/2024 [1.4.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.5): Windows 11 support added/validated. Ovis VL model support with image dataset calibration. Fixed `dynamic` loading. Reduced quantization vram usage.
* 12/15/2024 [1.4.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.2): MacOS `gpu` (Metal) and `cpu` (M+) support added/validated for inference and quantization. Cohere 2 model support added.
* 12/13/2024 [1.4.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.1): Added Qwen2-VL model support. `mse` quantization control exposed in `QuantizeConfig`. Monkey patch `patch_vllm()` and `patch_hf()` api added to allow Transformers/Optimum/PEFT and vLLM to correctly loaded GPTQModel quantized models while upstream PRs are in pending status.
* 12/10/2024 [1.4.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.0) `EvalPlus` harness integration merged upstream. We now support both `lm-eval` and `EvalPlus`. Added pure torch `Torch` kernel. Refactored `Cuda` kernel to be `DynamicCuda` kernel. `Triton` kernel now auto-padded for max model support. `Dynamic` quantization now supports both positive `+:`:default, and `-:` negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-`Marlin` kerenl selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving of `Marlin` weight format since `Marlin` supports auto conversion of `gptq` format to `Marlin` during runtime.

* 11/29/2024 [1.3.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.1) Olmo2 model support. Intel XPU acceleration via IPEX. Model sharding Transformer compat fix due to api deprecation in HF. Removed triton dependency. Triton kernel now optionally dependent on triton pkg.
* 11/26/2024 [1.3.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.0) Zero-Day Hymba model support. Removed `tqdm` and `rogue` dependency.
* 11/24/2024 [1.2.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.3) HF GLM model support. ClearML logging integration. Use `device-smi` and replace `gputil` + `psutil` depends. Fixed model unit tests.

<details>

<summary>Archived News:</summary>
* 11/26/2024 [1.3.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.0) Zero-Day Hymba model support. Removed `tqdm` and `rogue` dependency.
* 11/24/2024 [1.2.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.3) HF GLM model support. ClearML logging integration. Use `device-smi` and replace `gputil` + `psutil` depends. Fixed model unit tests.

* 11/11/2024 🚀 [1.2.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.1) Meta MobileLLM model support added. `lm-eval[gptqmodel]` integration merged upstream. Intel/IPEX cpu inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compat with latest transformers. New `.load()` and `.save()` api.

* 10/29/2024 🚀 [1.1.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.1.0) IBM Granite model support. Full auto-buildless wheel install from pypi. Reduce max cpu memory usage by >20% during quantization. 100% CI model/feature coverage.
Expand Down Expand Up @@ -78,7 +80,7 @@ Public tests/papers and ModelCloud's internal tests have shown that GPTQ is on-p
* 🚀 40% faster `packing` stage in quantization (Llama 3.1 8B). 50% faster PPL calculations (OPT).

## Quality: GPTQModel 4bit can match BF16:
🤗 [ModelCloud quantized ultra-high recovery vortex-series models on HF](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2)
🤗 [ModelCloud quantized Vortex models on HF](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2)

![image](https://github.com/user-attachments/assets/7b2db012-b8af-4d19-a25d-7023cef19220)

Expand Down Expand Up @@ -183,8 +185,8 @@ GPTQModel inference is integrated into both [lm-eval](https://github.com/Eleuthe
We highly recommend avoid using `ppl` and use `lm-eval`/`evalplus` to validate post-quantization model quality. `ppl` should only be used for regression tests and is not a good indicator of model output quality.

```
# gptqmodel is integrated into lm-eval >= v0.4.6
pip install lm-eval>=0.4.6
# gptqmodel is integrated into lm-eval >= v0.4.7
pip install lm-eval>=0.4.7
```

```
Expand Down
6 changes: 3 additions & 3 deletions gptqmodel/models/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@ def __init__(
qlinear_kernel: nn.Module = None,
load_quantized_model: bool = False,
trust_remote_code: bool = False,
model_id_or_path: str = None,
model_local_path: str = None,
):
super().__init__()

Expand All @@ -114,7 +114,7 @@ def __init__(
# compat: state to assist in checkpoint_format gptq(v1) to gptq_v2 conversion
self.qlinear_kernel = qlinear_kernel
self.trust_remote_code = trust_remote_code
self.model_id_or_path = model_id_or_path
self.model_local_path = model_local_path
# stores all per-layer quant stats such as avg loss and processing time
self.quant_log = []

Expand Down Expand Up @@ -774,7 +774,7 @@ def save(
):
extra_json_file_names = ["preprocessor_config.json", "chat_template.json"]
for name in extra_json_file_names:
json_path = os.path.join(self.model_id_or_path, name)
json_path = os.path.join(self.model_local_path, name)
if os.path.exists(json_path):
os.makedirs(save_dir, exist_ok=True)

Expand Down
4 changes: 2 additions & 2 deletions gptqmodel/models/definitions/qwen2_vl.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,10 +88,10 @@ def prepare_dataset(
import json

if tokenizer is None:
tokenizer = AutoTokenizer.from_pretrained(self.model_id_or_path)
tokenizer = AutoTokenizer.from_pretrained(self.model_local_path)

with tempfile.TemporaryDirectory() as tmp_dir:
chat_template_file = os.path.join(self.model_id_or_path, "chat_template.json")
chat_template_file = os.path.join(self.model_local_path, "chat_template.json")
if os.path.exists(chat_template_file):
shutil.copyfile(chat_template_file, os.path.join(tmp_dir, "chat_template.json"))
tokenizer.save_pretrained(tmp_dir)
Expand Down
45 changes: 30 additions & 15 deletions gptqmodel/models/loader.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from __future__ import annotations

import os
from importlib.metadata import PackageNotFoundError, version
from typing import Dict, List, Optional, Union

Expand Down Expand Up @@ -36,6 +37,7 @@
verify_sharded_model_hashes,
)
from ._const import DEVICE, SUPPORTED_MODELS, normalize_device
from huggingface_hub import snapshot_download


logger = setup_logger()
Expand Down Expand Up @@ -73,17 +75,25 @@ def compare_versions(installed_version, required_version, operator):
raise ValueError(f"Unsupported operator: {operator}")


def check_versions(model_id_or_path: str, requirements: List[str]):
def check_versions(model_class, requirements: List[str]):
if requirements is None:
return
for req in requirements:
pkg, operator, version_required = parse_requirement(req)
try:
installed_version = version(pkg)
if not compare_versions(installed_version, version_required, operator):
raise ValueError(f"{model_id_or_path} requires version {req}, but current {pkg} version is {installed_version} ")
raise ValueError(f"{model_class} requires version {req}, but current {pkg} version is {installed_version} ")
except PackageNotFoundError:
raise ValueError(f"{model_id_or_path} requires version {req}, but {pkg} not installed.")
raise ValueError(f"{model_class} requires version {req}, but {pkg} not installed.")


def get_model_local_path(pretrained_model_id_or_path, **kwargs):
is_local = os.path.isdir(pretrained_model_id_or_path)
if is_local:
return pretrained_model_id_or_path
else:
return snapshot_download(pretrained_model_id_or_path, **kwargs)


def ModelLoader(cls):
Expand All @@ -106,7 +116,9 @@ def from_pretrained(
f"{pretrained_model_id_or_path} requires trust_remote_code=True. Please set trust_remote_code=True to load this model."
)

check_versions(pretrained_model_id_or_path, cls.require_pkgs_version)
check_versions(cls, cls.require_pkgs_version)

model_local_path = get_model_local_path(pretrained_model_id_or_path, **model_init_kwargs)

def skip(*args, **kwargs):
pass
Expand All @@ -117,7 +129,7 @@ def skip(*args, **kwargs):

model_init_kwargs["trust_remote_code"] = trust_remote_code

config = AutoConfig.from_pretrained(pretrained_model_id_or_path, **model_init_kwargs)
config = AutoConfig.from_pretrained(model_local_path, **model_init_kwargs)

if torch_dtype is None or torch_dtype == "auto":
torch_dtype = auto_dtype_from_config(config)
Expand All @@ -130,7 +142,7 @@ def skip(*args, **kwargs):
if config.model_type not in SUPPORTED_MODELS:
raise TypeError(f"{config.model_type} isn't supported yet.")

model = cls.loader.from_pretrained(pretrained_model_id_or_path, **model_init_kwargs)
model = cls.loader.from_pretrained(model_local_path, **model_init_kwargs)

model_config = model.config.to_dict()
seq_len_keys = ["max_position_embeddings", "seq_length", "n_positions", "multimodal_max_length"]
Expand All @@ -149,7 +161,7 @@ def skip(*args, **kwargs):
quantized=False,
quantize_config=quantize_config,
trust_remote_code=trust_remote_code,
model_id_or_path=pretrained_model_id_or_path
model_local_path=model_local_path,
)

cls.from_pretrained = from_pretrained
Expand Down Expand Up @@ -189,7 +201,9 @@ def from_quantized(
f"{model_id_or_path} requires trust_remote_code=True. Please set trust_remote_code=True to load this model."
)

check_versions(model_id_or_path, cls.require_pkgs_version)
check_versions(cls, cls.require_pkgs_version)

model_local_path = get_model_local_path(model_id_or_path, **kwargs)

# Parameters related to loading from Hugging Face Hub
cache_dir = kwargs.pop("cache_dir", None)
Expand Down Expand Up @@ -217,7 +231,7 @@ def from_quantized(

# == step1: prepare configs and file names == #
config: PretrainedConfig = AutoConfig.from_pretrained(
model_id_or_path,
model_local_path,
trust_remote_code=trust_remote_code,
**cached_file_kwargs,
)
Expand All @@ -231,7 +245,7 @@ def from_quantized(
if config.model_type not in SUPPORTED_MODELS:
raise TypeError(f"{config.model_type} isn't supported yet.")

quantize_config = QuantizeConfig.from_pretrained(model_id_or_path, **cached_file_kwargs, **kwargs)
quantize_config = QuantizeConfig.from_pretrained(model_local_path, **cached_file_kwargs, **kwargs)

if backend == BACKEND.VLLM or backend == BACKEND.SGLANG:
if quantize_config.format != FORMAT.GPTQ:
Expand All @@ -240,7 +254,7 @@ def from_quantized(
from ..utils.vllm import load_model_by_vllm, vllm_generate

model = load_model_by_vllm(
model=model_id_or_path,
model=model_local_path,
trust_remote_code=trust_remote_code,
**kwargs,
)
Expand All @@ -253,7 +267,7 @@ def from_quantized(
from ..utils.sglang import load_model_by_sglang, sglang_generate

model, hf_config = load_model_by_sglang(
model=model_id_or_path,
model=model_local_path,
trust_remote_code=trust_remote_code,
**kwargs,
)
Expand All @@ -264,6 +278,7 @@ def from_quantized(
quantized=True,
quantize_config=quantize_config,
qlinear_kernel=None,
model_local_path=model_local_path,
)

if quantize_config.format == FORMAT.MARLIN:
Expand Down Expand Up @@ -299,11 +314,11 @@ def from_quantized(

extensions = [".safetensors"]

model_id_or_path = str(model_id_or_path)
model_local_path = str(model_local_path)

# Retrieve (and if necessary download) the quantized checkpoint(s).
is_sharded, resolved_archive_file, true_model_basename = get_checkpoints(
model_id_or_path=model_id_or_path,
model_id_or_path=model_local_path,
extensions=extensions,
possible_model_basenames=possible_model_basenames,
**cached_file_kwargs,
Expand Down Expand Up @@ -529,7 +544,7 @@ def skip(*args, **kwargs):
qlinear_kernel=qlinear_kernel,
load_quantized_model=True,
trust_remote_code=trust_remote_code,
model_id_or_path=model_id_or_path,
model_local_path=model_local_path,
)

cls.from_quantized = from_quantized
Expand Down
6 changes: 3 additions & 3 deletions gptqmodel/models/writer.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ def save_quantized(
w.writerows([[entry.get(QUANT_LOG_LAYER), entry.get(QUANT_LOG_MODULE), entry.get(QUANT_LOG_LOSS),
entry.get(QUANT_LOG_DAMP), entry.get(QUANT_LOG_TIME)] for entry in self.quant_log])

pre_quantized_size_mb = get_model_files_size(self.model_id_or_path)
pre_quantized_size_mb = get_model_files_size(self.model_local_path)
pre_quantized_size_gb = pre_quantized_size_mb / 1024

quantizers = [f"{META_QUANTIZER_GPTQMODEL}:{__version__}"]
Expand Down Expand Up @@ -171,7 +171,7 @@ def save_quantized(
else:
model = self.get_model_with_quantize(
quantize_config=quantize_config,
model_id_or_path=self.model_id_or_path,
model_id_or_path=self.model_local_path,
)

model.to(CPU)
Expand Down Expand Up @@ -311,7 +311,7 @@ def save_quantized(

# need to copy .py files for model/tokenizers not yet merged to HF transformers
if self.trust_remote_code:
copy_py_files(save_dir, model_id_or_path=self.model_id_or_path)
copy_py_files(save_dir, model_id_or_path=self.model_local_path)

cls.save_quantized = save_quantized

Expand Down
2 changes: 1 addition & 1 deletion gptqmodel/nn_modules/qlinear/ipex.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@
try:
from intel_extension_for_pytorch.llm.quantization import IPEXWeightOnlyQuantizedLinear
HAS_IPEX = True
except Exception:
except BaseException:
HAS_IPEX = False
IPEX_ERROR_LOG = Exception

Expand Down
2 changes: 1 addition & 1 deletion gptqmodel/version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "1.4.6-dev"
__version__ = "1.5.0"
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -277,7 +277,7 @@ def run(self):
'ipex': ["intel_extension_for_pytorch>=2.5.0"],
'auto_round': ["auto_round>=0.3"],
'logger': ["clearml", "random_word", "plotly"],
'eval': ["lm_eval>=0.4.6", "evalplus>=0.3.1"],
'eval': ["lm_eval>=0.4.7", "evalplus>=0.3.1"],
'triton': ["triton>=2.0.0"]
},
include_dirs=include_dirs,
Expand Down
6 changes: 3 additions & 3 deletions tests/models/model_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ def lm_eval(self, model, apply_chat_template=False, trust_remote_code=False, del
try:
with tempfile.TemporaryDirectory() as tmp_dir:
if self.USE_VLLM:
model_args = f"pretrained={model.model_id_or_path},dtype=auto,gpu_memory_utilization=0.8,tensor_parallel_size=1,trust_remote_code={trust_remote_code},max_model_len={self.MODEL_MAX_LEN}"
model_args = f"pretrained={model.model_local_path},dtype=auto,gpu_memory_utilization=0.8,tensor_parallel_size=1,trust_remote_code={trust_remote_code},max_model_len={self.MODEL_MAX_LEN}"
else:
model_args = ""
results = lm_eval(
Expand Down Expand Up @@ -216,8 +216,8 @@ def lm_eval(self, model, apply_chat_template=False, trust_remote_code=False, del
if metric != 'alias' and 'stderr' not in metric
}
print(task_results)
if delete_quantized_model and model.model_id_or_path.startswith("/tmp") and os.path.exists(model.model_id_or_path):
shutil.rmtree(model.model_id_or_path)
if delete_quantized_model and model.model_local_path.startswith("/tmp") and os.path.exists(model.model_local_path):
shutil.rmtree(model.model_local_path)
return task_results
except BaseException as e:
if isinstance(e, torch.OutOfMemoryError):
Expand Down

0 comments on commit 5d0fc04

Please sign in to comment.