Name		Name	Last commit message	Last commit date
parent directory ..
vllm		vllm
README.md		README.md
requirements.txt		requirements.txt

README.md

Getting Started with Arctic on Hugging Face

Dependencies

Install the following packages, they can all be found under requirements.txt as well. We are actively working on upstreaming both the transformers and vllm changes required to run Arctic but for now you will need to use our forks.

deepspeed>=0.14.2
git+git://github.com/Snowflake-Labs/transformers.git@arctic
git+git://github.com/Snowflake-Labs/vllm.git@arctic
huggingface_hub[hf_transfer]

We highly recommend using hf_transfer to download the Arctic weights, this will greatly reduce the time you are sitting waiting for the checkpoint shards to download.

Run Arctic Example

Due to the model size we recommend using a single 8xH100 instance from your favorite cloud provider such as: AWS p5.48xlarge, Azure ND96isr_H100_v5, etc.

In this example we are using FP8 quantization provided by DeepSpeed in the backend, we can also use FP6 quantization by specifying q_bits=6 in the ArcticQuantizationConfig config. The "150GiB" setting for max_memory is required until we can get DeepSpeed's FP quantization supported natively as a HFQuantizer which we are actively working on.

import os
# enable hf_transfer for faster ckpt download
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.models.arctic.configuration_arctic import ArcticQuantizationConfig

tokenizer = AutoTokenizer.from_pretrained("Snowflake/snowflake-arctic-instruct")

quant_config = ArcticQuantizationConfig(q_bits=8)

model = AutoModelForCausalLM.from_pretrained(
    "Snowflake/snowflake-arctic-instruct",
    low_cpu_mem_usage=True,
    device_map="auto",
    ds_quantization_config=quant_config,
    max_memory={i: "150GiB" for i in range(8)},
    torch_dtype=torch.bfloat16)

messages = [{"role": "user", "content": "What is 1 + 1 "}]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to("cuda")

outputs = model.generate(input_ids=input_ids, max_new_tokens=20)
print(tokenizer.decode(outputs[0]))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference

inference

README.md

Getting Started with Arctic on Hugging Face

Dependencies

Run Arctic Example

Files

inference

Directory actions

More options

Directory actions

More options

Latest commit

History

inference

Folders and files

parent directory

README.md

Getting Started with Arctic on Hugging Face

Dependencies

Run Arctic Example