Hugging Face Llama Recipes

🤗🦙Welcome! This repository contains minimal recipes to get started quickly with Llama 3.x models, including Llama 3.1 and Llama 3.2.

To get an overview of Llama 3.1, please visit Hugging Face announcement blog post (3.1).
To get an overview of Llama 3.2, please visit Hugging Face announcement blog post (3.2).
For more advanced end-to-end use cases with open ML, please visit the Open Source AI Cookbook.

This repository is WIP so that you might see considerable changes in the coming days.

Note

To use Llama 3.x, you need to accept the license and request permission to access the models. Please visit the Hugging Face repos and submit your request. You only need to do this once per collection; you'll get access to all the repos in the collection if your request is approved.

Getting Started

The easiest way to quickly run a Llama 🦙 on your machine would be with the 🤗 transformers repository. Make sure you have the latest release installed.

$ pip install -U transformers

Let us conversate with an instruction tuned model.

import torch
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"

llama_31 = "meta-llama/Llama-3.1-8B-Instruct" # <-- llama 3.1
llama_32 = "meta-llama/Llama-3.2-3B-Instruct" # <-- llama 3.2

prompt = [
    {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
    {"role": "user", "content": "What's Deep Learning?"},
]

generator = pipeline(model=llama_32, device=device, torch_dtype=torch.bfloat16)
generation = generator(
    prompt,
    do_sample=False,
    temperature=1.0,
    top_p=1,
    max_new_tokens=50
)

print(f"Generation: {generation[0]['generated_text']}")

# Generation:
# [
#   {'role': 'system', 'content': 'You are a helpful assistant, that responds as a pirate.'},
#   {'role': 'user', 'content': "What's Deep Learning?"},
#   {'role': 'assistant', 'content': "Yer lookin' fer a treasure trove o'
#             knowledge on Deep Learnin', eh? Alright then, listen close and
#             I'll tell ye about it.\n\nDeep Learnin' be a type o' machine
#             learnin' that uses neural networks"}
# ]

Local Inference

Would you like to run inference of the Llama models locally? So do we! The memory requirements depend on the model size and the precision of the weights. Here's a table showing the approximate memory needed for different configurations:

Model Size	Llama Variant	BF16/FP16	FP8	INT4(AWQ/GPTQ/bnb)
1B	3.2	2.5 GB	1.25GB	0.75GB
3B	3.2	6.5 GB	3.2GB	1.75GB
8B	3.1	16 GB	8GB	4GB
70B	3.1	140 GB	70GB	35GB
405B	3.1	810 GB	405GB	204GB

Note

These are estimated values and may vary based on specific implementation details and optimizations.

Working with the capable Llama 3.1 8B models:

Working on the 🐘 big Llama 3.1 405B model:

Model Fine Tuning:

It is often not enough to run inference on the model. Many times, you need to fine-tune the model on some custom dataset. Here are some scripts showing how to fine-tune the models.

Fine tune models on your custom dataset:

Assisted Decoding Techniques

Do you want to use the smaller Llama 3.2 models to speedup text generation of bigger models? These notebooks showcase assisted decoding (speculative decoding), which gives you upto 2x speedups for text generation on Llama 3.1 70B (with greedy decoding).

Performance Optimization

Let us optimize performace shall we?

API inference

Are these models too large for you to run at home? Would you like to experiment with Llama 70B? Try out the following examples!

Use the Inference API for PRO users

Llama Guard and Prompt Guard

In addition to the generative models, Meta released two new models: Llama Guard 3 and Prompt Guard. Prompt Guard is a small classifier that detects jailbreaks and prompt injections. Llama Guard 3 is a safeguard model that can classify LLM inputs and generations. Learn how to use them as done in the following notebooks:

Synthetic Data Generation

With the ever hungry models, the need for synthetic data generation is on the rise. Here we show you how to build your very own synthetic dataset.

Generate synthetic data with distilabel

Llama RAG

Seeking an entry-level RAG pipeline? This notebook guides you through building a very simple streamlined RAG experiment using Llama and Hugging Face.

Simple RAG Pipeline

Text Generation Inference (TGI) & API Inference with Llama Models

Text Generation Inference (TGI) framework enables efficient and scalable deployment of Llama models. In this notebook we'll learn how to integrate TGI for fast text generation and to consume already deployed Llama models via Inference API:

Text Generation Inference (TGI) with Llama Models

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.github		.github
api_inference		api_inference
assets		assets
assisted_decoding		assisted_decoding
fine_tune		fine_tune
llama.cpp		llama.cpp
llama_guard		llama_guard
llama_rag		llama_rag
llama_tgi_api_inference		llama_tgi_api_inference
local_inference		local_inference
performance_optimization		performance_optimization
synthetic_data_gen		synthetic_data_gen
.gitignore		.gitignore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hugging Face Llama Recipes

Getting Started

Local Inference

Model Fine Tuning:

Assisted Decoding Techniques

Performance Optimization

API inference

Llama Guard and Prompt Guard

Synthetic Data Generation

Llama RAG

Text Generation Inference (TGI) & API Inference with Llama Models

About

Releases

Packages

Contributors 23

Languages

huggingface/huggingface-llama-recipes

Folders and files

Latest commit

History

Repository files navigation

Hugging Face Llama Recipes

Getting Started

Local Inference

Model Fine Tuning:

Assisted Decoding Techniques

Performance Optimization

API inference

Llama Guard and Prompt Guard

Synthetic Data Generation

Llama RAG

Text Generation Inference (TGI) & API Inference with Llama Models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 23

Languages

Packages