Merge remote-tracking branch 'origin/main' into CSY/new-ci

# Conflicts: # .github/workflows/unit_tests.yml
ModelCloud · Dec 24, 2024 · 5d0fc04 · 5d0fc04
2 parents 34e4ff2 + 4197cd8
commit 5d0fc04
Show file tree

Hide file tree

Showing 9 changed files with 51 additions and 34 deletions.
diff --git a/README.md b/README.md
@@ -9,18 +9,20 @@
 </p>
 
 ## News
+* 12/23/2024 [1.5.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.5.0): Multi-modal (image-to-text) optimized quantization support has been added for Qwen 2-VL and Ovis 1.6-VL. Previous image-to-text model quantizations did not use image calibration data, resulting in less than optimal post-quantization results. Version 1.5.0 is the first release to provide a stable path for multi-modal quantization: only text layers are quantized.
 * 12/19/2024 [1.4.5](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.5): Windows 11 support added/validated. Ovis VL model support with image dataset calibration. Fixed `dynamic` loading. Reduced quantization vram usage. 
 * 12/15/2024 [1.4.2](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.2): MacOS `gpu` (Metal) and `cpu` (M+) support added/validated for inference and quantization. Cohere 2 model support added. 
 * 12/13/2024 [1.4.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.1): Added Qwen2-VL model support. `mse` quantization control exposed in `QuantizeConfig`. Monkey patch `patch_vllm()` and `patch_hf()` api added to allow Transformers/Optimum/PEFT and vLLM to correctly loaded GPTQModel quantized models while upstream PRs are in pending status. 
 * 12/10/2024 [1.4.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.4.0) `EvalPlus` harness integration merged upstream. We now support both `lm-eval` and `EvalPlus`. Added pure torch `Torch` kernel. Refactored `Cuda` kernel to be `DynamicCuda` kernel. `Triton` kernel now auto-padded for max model support. `Dynamic` quantization now supports both positive `+:`:default, and `-:` negative matching which allows matched modules to be skipped entirely for quantization. Fixed auto-`Marlin` kerenl selection. Added auto-kernel fallback for unsupported kernel/module pairs. Lots of internal refractor and cleanup in-preparation for transformers/optimum/peft upstream PR merge. Deprecated the saving of `Marlin` weight format since `Marlin` supports auto conversion of `gptq` format to `Marlin` during runtime. 
 
 * 11/29/2024 [1.3.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.1) Olmo2 model support. Intel XPU acceleration via IPEX. Model sharding Transformer compat fix due to api deprecation in HF. Removed triton dependency. Triton kernel now optionally dependent on triton pkg. 
-* 11/26/2024 [1.3.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.0) Zero-Day Hymba model support. Removed `tqdm` and `rogue` dependency. 
-* 11/24/2024 [1.2.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.3) HF GLM model support. ClearML logging integration. Use `device-smi` and replace `gputil` + `psutil` depends. Fixed model unit tests. 
 
 <details>
 
 <summary>Archived News:</summary>
+* 11/26/2024 [1.3.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.3.0) Zero-Day Hymba model support. Removed `tqdm` and `rogue` dependency. 
+* 11/24/2024 [1.2.3](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.3) HF GLM model support. ClearML logging integration. Use `device-smi` and replace `gputil` + `psutil` depends. Fixed model unit tests. 
+
 * 11/11/2024 🚀 [1.2.1](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.2.1) Meta MobileLLM model support added. `lm-eval[gptqmodel]` integration merged upstream. Intel/IPEX cpu inference merged replacing QBits (deprecated). Auto-fix/patch ChatGLM-3/GLM-4 compat with latest transformers. New `.load()` and `.save()` api. 
 
 * 10/29/2024 🚀 [1.1.0](https://github.com/ModelCloud/GPTQModel/releases/tag/v1.1.0) IBM Granite model support. Full auto-buildless wheel install from pypi. Reduce max cpu memory usage by >20% during quantization. 100% CI model/feature coverage. 
@@ -78,7 +80,7 @@ Public tests/papers and ModelCloud's internal tests have shown that GPTQ is on-p
 * 🚀 40% faster `packing` stage in quantization (Llama 3.1 8B). 50% faster PPL calculations (OPT).
 
 ## Quality: GPTQModel 4bit can match BF16:
-🤗 [ModelCloud quantized ultra-high recovery vortex-series models on HF](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2)
+🤗 [ModelCloud quantized Vortex models on HF](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2)
 
 ![image](https://github.com/user-attachments/assets/7b2db012-b8af-4d19-a25d-7023cef19220)
 
@@ -183,8 +185,8 @@ GPTQModel inference is integrated into both [lm-eval](https://github.com/Eleuthe
 We highly recommend avoid using `ppl` and use `lm-eval`/`evalplus` to validate post-quantization model quality. `ppl` should only be used for regression tests and is not a good indicator of model output quality.  
 
 ```
-# gptqmodel is integrated into lm-eval >= v0.4.6
-pip install lm-eval>=0.4.6
+# gptqmodel is integrated into lm-eval >= v0.4.7
+pip install lm-eval>=0.4.7
 ```
 
 ```

diff --git a/gptqmodel/models/base.py b/gptqmodel/models/base.py
@@ -101,7 +101,7 @@ def __init__(
         qlinear_kernel: nn.Module = None,
         load_quantized_model: bool = False,
         trust_remote_code: bool = False,
-        model_id_or_path: str = None,
+        model_local_path: str = None,
     ):
         super().__init__()
 
@@ -114,7 +114,7 @@ def __init__(
         # compat: state to assist in checkpoint_format gptq(v1) to gptq_v2 conversion
         self.qlinear_kernel = qlinear_kernel
         self.trust_remote_code = trust_remote_code
-        self.model_id_or_path = model_id_or_path
+        self.model_local_path = model_local_path
         # stores all per-layer quant stats such as avg loss and processing time
         self.quant_log = []
 
@@ -774,7 +774,7 @@ def save(
     ):
         extra_json_file_names = ["preprocessor_config.json", "chat_template.json"]
         for name in extra_json_file_names:
-            json_path = os.path.join(self.model_id_or_path, name)
+            json_path = os.path.join(self.model_local_path, name)
             if os.path.exists(json_path):
                 os.makedirs(save_dir, exist_ok=True)
 

diff --git a/gptqmodel/models/definitions/qwen2_vl.py b/gptqmodel/models/definitions/qwen2_vl.py
@@ -88,10 +88,10 @@ def prepare_dataset(
         import json
 
         if tokenizer is None:
-            tokenizer = AutoTokenizer.from_pretrained(self.model_id_or_path)
+            tokenizer = AutoTokenizer.from_pretrained(self.model_local_path)
 
         with tempfile.TemporaryDirectory() as tmp_dir:
-            chat_template_file = os.path.join(self.model_id_or_path, "chat_template.json")
+            chat_template_file = os.path.join(self.model_local_path, "chat_template.json")
             if os.path.exists(chat_template_file):
                 shutil.copyfile(chat_template_file, os.path.join(tmp_dir, "chat_template.json"))
             tokenizer.save_pretrained(tmp_dir)

diff --git a/gptqmodel/models/loader.py b/gptqmodel/models/loader.py
@@ -1,5 +1,6 @@
 from __future__ import annotations
 
+import os
 from importlib.metadata import PackageNotFoundError, version
 from typing import Dict, List, Optional, Union
 
@@ -36,6 +37,7 @@
     verify_sharded_model_hashes,
 )
 from ._const import DEVICE, SUPPORTED_MODELS, normalize_device
+from huggingface_hub import snapshot_download
 
 
 logger = setup_logger()
@@ -73,17 +75,25 @@ def compare_versions(installed_version, required_version, operator):
         raise ValueError(f"Unsupported operator: {operator}")
 
 
-def check_versions(model_id_or_path: str, requirements: List[str]):
+def check_versions(model_class, requirements: List[str]):
     if requirements is None:
         return
     for req in requirements:
         pkg, operator, version_required = parse_requirement(req)
         try:
             installed_version = version(pkg)
             if not compare_versions(installed_version, version_required, operator):
-                raise ValueError(f"{model_id_or_path} requires version {req}, but current {pkg} version is {installed_version} ")
+                raise ValueError(f"{model_class} requires version {req}, but current {pkg} version is {installed_version} ")
         except PackageNotFoundError:
-            raise ValueError(f"{model_id_or_path} requires version {req}, but {pkg} not installed.")
+            raise ValueError(f"{model_class} requires version {req}, but {pkg} not installed.")
+
+
+def get_model_local_path(pretrained_model_id_or_path, **kwargs):
+    is_local = os.path.isdir(pretrained_model_id_or_path)
+    if is_local:
+        return pretrained_model_id_or_path
+    else:
+        return snapshot_download(pretrained_model_id_or_path, **kwargs)
 
 
 def ModelLoader(cls):
@@ -106,7 +116,9 @@ def from_pretrained(
                 f"{pretrained_model_id_or_path} requires trust_remote_code=True. Please set trust_remote_code=True to load this model."
             )
 
-        check_versions(pretrained_model_id_or_path, cls.require_pkgs_version)
+        check_versions(cls, cls.require_pkgs_version)
+
+        model_local_path = get_model_local_path(pretrained_model_id_or_path, **model_init_kwargs)
 
         def skip(*args, **kwargs):
             pass
@@ -117,7 +129,7 @@ def skip(*args, **kwargs):
 
         model_init_kwargs["trust_remote_code"] = trust_remote_code
 
-        config = AutoConfig.from_pretrained(pretrained_model_id_or_path, **model_init_kwargs)
+        config = AutoConfig.from_pretrained(model_local_path, **model_init_kwargs)
 
         if torch_dtype is None or torch_dtype == "auto":
             torch_dtype = auto_dtype_from_config(config)
@@ -130,7 +142,7 @@ def skip(*args, **kwargs):
         if config.model_type not in SUPPORTED_MODELS:
             raise TypeError(f"{config.model_type} isn't supported yet.")
 
-        model = cls.loader.from_pretrained(pretrained_model_id_or_path, **model_init_kwargs)
+        model = cls.loader.from_pretrained(model_local_path, **model_init_kwargs)
 
         model_config = model.config.to_dict()
         seq_len_keys = ["max_position_embeddings", "seq_length", "n_positions", "multimodal_max_length"]
@@ -149,7 +161,7 @@ def skip(*args, **kwargs):
             quantized=False,
             quantize_config=quantize_config,
             trust_remote_code=trust_remote_code,
-            model_id_or_path=pretrained_model_id_or_path
+            model_local_path=model_local_path,
         )
 
     cls.from_pretrained = from_pretrained
@@ -189,7 +201,9 @@ def from_quantized(
                 f"{model_id_or_path} requires trust_remote_code=True. Please set trust_remote_code=True to load this model."
             )
 
-        check_versions(model_id_or_path, cls.require_pkgs_version)
+        check_versions(cls, cls.require_pkgs_version)
+
+        model_local_path = get_model_local_path(model_id_or_path, **kwargs)
 
         # Parameters related to loading from Hugging Face Hub
         cache_dir = kwargs.pop("cache_dir", None)
@@ -217,7 +231,7 @@ def from_quantized(
 
         # == step1: prepare configs and file names == #
         config: PretrainedConfig = AutoConfig.from_pretrained(
-            model_id_or_path,
+            model_local_path,
             trust_remote_code=trust_remote_code,
             **cached_file_kwargs,
         )
@@ -231,7 +245,7 @@ def from_quantized(
         if config.model_type not in SUPPORTED_MODELS:
             raise TypeError(f"{config.model_type} isn't supported yet.")
 
-        quantize_config = QuantizeConfig.from_pretrained(model_id_or_path, **cached_file_kwargs, **kwargs)
+        quantize_config = QuantizeConfig.from_pretrained(model_local_path, **cached_file_kwargs, **kwargs)
 
         if backend == BACKEND.VLLM or backend == BACKEND.SGLANG:
             if quantize_config.format != FORMAT.GPTQ:
@@ -240,7 +254,7 @@ def from_quantized(
                 from ..utils.vllm import load_model_by_vllm, vllm_generate
 
                 model = load_model_by_vllm(
-                    model=model_id_or_path,
+                    model=model_local_path,
                     trust_remote_code=trust_remote_code,
                     **kwargs,
                 )
@@ -253,7 +267,7 @@ def from_quantized(
                 from ..utils.sglang import load_model_by_sglang, sglang_generate
 
                 model, hf_config = load_model_by_sglang(
-                    model=model_id_or_path,
+                    model=model_local_path,
                     trust_remote_code=trust_remote_code,
                     **kwargs,
                 )
@@ -264,6 +278,7 @@ def from_quantized(
                 quantized=True,
                 quantize_config=quantize_config,
                 qlinear_kernel=None,
+                model_local_path=model_local_path,
             )
 
         if quantize_config.format == FORMAT.MARLIN:
@@ -299,11 +314,11 @@ def from_quantized(
 
         extensions = [".safetensors"]
 
-        model_id_or_path = str(model_id_or_path)
+        model_local_path = str(model_local_path)
 
         # Retrieve (and if necessary download) the quantized checkpoint(s).
         is_sharded, resolved_archive_file, true_model_basename = get_checkpoints(
-            model_id_or_path=model_id_or_path,
+            model_id_or_path=model_local_path,
             extensions=extensions,
             possible_model_basenames=possible_model_basenames,
             **cached_file_kwargs,
@@ -529,7 +544,7 @@ def skip(*args, **kwargs):
             qlinear_kernel=qlinear_kernel,
             load_quantized_model=True,
             trust_remote_code=trust_remote_code,
-            model_id_or_path=model_id_or_path,
+            model_local_path=model_local_path,
         )
 
     cls.from_quantized = from_quantized

diff --git a/gptqmodel/models/writer.py b/gptqmodel/models/writer.py
@@ -84,7 +84,7 @@ def save_quantized(
                 w.writerows([[entry.get(QUANT_LOG_LAYER), entry.get(QUANT_LOG_MODULE), entry.get(QUANT_LOG_LOSS),
                               entry.get(QUANT_LOG_DAMP), entry.get(QUANT_LOG_TIME)] for entry in self.quant_log])
 
-        pre_quantized_size_mb = get_model_files_size(self.model_id_or_path)
+        pre_quantized_size_mb = get_model_files_size(self.model_local_path)
         pre_quantized_size_gb = pre_quantized_size_mb / 1024
 
         quantizers = [f"{META_QUANTIZER_GPTQMODEL}:{__version__}"]
@@ -171,7 +171,7 @@ def save_quantized(
         else:
             model = self.get_model_with_quantize(
                 quantize_config=quantize_config,
-                model_id_or_path=self.model_id_or_path,
+                model_id_or_path=self.model_local_path,
             )
 
         model.to(CPU)
@@ -311,7 +311,7 @@ def save_quantized(
 
         # need to copy .py files for model/tokenizers not yet merged to HF transformers
         if self.trust_remote_code:
-            copy_py_files(save_dir, model_id_or_path=self.model_id_or_path)
+            copy_py_files(save_dir, model_id_or_path=self.model_local_path)
 
     cls.save_quantized = save_quantized
 

diff --git a/gptqmodel/nn_modules/qlinear/ipex.py b/gptqmodel/nn_modules/qlinear/ipex.py
@@ -27,7 +27,7 @@
 try:
     from intel_extension_for_pytorch.llm.quantization import IPEXWeightOnlyQuantizedLinear
     HAS_IPEX = True
-except Exception:
+except BaseException:
     HAS_IPEX = False
     IPEX_ERROR_LOG = Exception
 

diff --git a/gptqmodel/version.py b/gptqmodel/version.py
@@ -1 +1 @@
-__version__ = "1.4.6-dev"
+__version__ = "1.5.0"
diff --git a/setup.py b/setup.py
@@ -277,7 +277,7 @@ def run(self):
         'ipex': ["intel_extension_for_pytorch>=2.5.0"],
         'auto_round': ["auto_round>=0.3"],
         'logger': ["clearml", "random_word", "plotly"],
-        'eval': ["lm_eval>=0.4.6", "evalplus>=0.3.1"],
+        'eval': ["lm_eval>=0.4.7", "evalplus>=0.3.1"],
         'triton': ["triton>=2.0.0"]
     },
     include_dirs=include_dirs,

diff --git a/tests/models/model_test.py b/tests/models/model_test.py
@@ -187,7 +187,7 @@ def lm_eval(self, model, apply_chat_template=False, trust_remote_code=False, del
         try:
             with tempfile.TemporaryDirectory() as tmp_dir:
                 if self.USE_VLLM:
-                    model_args = f"pretrained={model.model_id_or_path},dtype=auto,gpu_memory_utilization=0.8,tensor_parallel_size=1,trust_remote_code={trust_remote_code},max_model_len={self.MODEL_MAX_LEN}"
+                    model_args = f"pretrained={model.model_local_path},dtype=auto,gpu_memory_utilization=0.8,tensor_parallel_size=1,trust_remote_code={trust_remote_code},max_model_len={self.MODEL_MAX_LEN}"
                 else:
                     model_args = ""
                 results = lm_eval(
@@ -216,8 +216,8 @@ def lm_eval(self, model, apply_chat_template=False, trust_remote_code=False, del
                     if metric != 'alias' and 'stderr' not in metric
                 }
                 print(task_results)
-                if delete_quantized_model and model.model_id_or_path.startswith("/tmp") and os.path.exists(model.model_id_or_path):
-                    shutil.rmtree(model.model_id_or_path)
+                if delete_quantized_model and model.model_local_path.startswith("/tmp") and os.path.exists(model.model_local_path):
+                    shutil.rmtree(model.model_local_path)
                 return task_results
         except BaseException as e:
             if isinstance(e, torch.OutOfMemoryError):