Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog #153

irthomasthomas · 2023-12-14T17:51:46Z

Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog

Stacking transformer layers to create large models results in better accuracies, few-shot learning capabilities, and even near-human emergent abilities on a wide range of language tasks. These foundation models are expensive to train, and they can be memory- and compute-intensive during inference (a recurring cost). The most popular large language models (LLMs) today can reach tens to hundreds of billions of parameters in size and, depending on the use case, may require ingesting long inputs (or contexts), which can also add expense.

This post discusses the most pressing challenges in LLM inference, along with some practical solutions. Readers should have a basic understanding of transformer architecture and the attention mechanism in general. It is essential to have a grasp of the intricacies of LLM inference, which we will address in the next section.

irthomasthomas removed the inbox-url label May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog #153

Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog #153

irthomasthomas commented Dec 14, 2023

Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog #153

Mastering LLM Techniques: Inference Optimization | NVIDIA Technical Blog #153

Comments

irthomasthomas commented Dec 14, 2023