Skip to content

Commit

Permalink
Update 2024-10-15.html
Browse files Browse the repository at this point in the history
  • Loading branch information
sprasadhpy authored Nov 8, 2024
1 parent 1421268 commit 42c20a6
Showing 1 changed file with 54 additions and 0 deletions.
54 changes: 54 additions & 0 deletions 2024-10-15.html
Original file line number Diff line number Diff line change
@@ -1 +1,55 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Quantization in Deep Learning</title>
<style>
body {
font-family: Arial, sans-serif;
line-height: 1.6;
margin: 20px;
}
h1 {
color: #333;
}
h2 {
color: #555;
}
p {
margin-bottom: 15px;
}
</style>
</head>
<body>
<h1>Quantization Techniques in LLM </h1>
<h2>Overview</h2>
<p>Deep learning models often necessitate powerful graphical processing units (GPUs) to handle vast amounts of data and execute intricate calculations during the training of deep neural networks. However, the expectation of having unlimited access to computational resources, particularly GPUs, is frequently unrealistic due to their significant expense.</p>

<h2>Understanding Quantization</h2>
<p>The domain of Quantization in deep learning aims to tackle these challenges by minimizing the computational and memory costs tied to deep learning models. This is accomplished by utilizing low-precision data types to represent weights and activations within these models. For example, a 32-bit floating-point (float32) representation demands considerably more resources compared to an 8-bit integer (int8). This process simplifies by lowering the bit count used for data types, thereby reducing computational expenses.</p>
<p>Moreover, mathematical operations, such as matrix multiplications, can be executed more swiftly when employing lower precision data types. Currently, many neural networks are trained using 16-bit floating-point formats, such as fp16 or bfloat16, which are widely supported by deep learning accelerators. Post-training, these networks can be deployed for inference using even lower-precision formats, which can include floating-point, fixed-point, and integer representations.</p>

<h2>Benefits of Low-Precision Formats</h2>
<p>Using low-precision formats presents several advantages in performance. Many processors are equipped with high-throughput mathematical pipelines tailored for low-bit formats, expediting computation-intensive tasks like convolutions and matrix multiplications. Additionally, reduced word sizes mitigate memory bandwidth issues, leading to enhanced performance in bandwidth-limited scenarios. Smaller word sizes also lessen memory requirements, improving cache utilization and overall memory system efficiency.</p>

<h2>Int8 Quantization Achievements</h2>
<p>In this research, implementing int8 quantization achieves model accuracy that remains within 1% of baseline floating-point networks. This is particularly impressive for models that are typically challenging to quantize, including MobileNets and BERT-large. Furthermore, vector quantization plays a significant role in effectively compressing deep convolutional networks. Generally, convolutional neural networks (CNNs) designed for object classification encompass multiple layers with a high number of parameters, which are often excessively parameterized. This work focuses on compressing these parameters while retaining high accuracy, emphasizing vector quantization methods for densely connected layers.</p>
<p>This includes techniques such as parameter binarization, scalar quantization through k-means clustering, and structured quantization using product quantization or residual quantization, all of which contribute to notable improvements in performance.</p>

<h2>Pre-trained Language Models</h2>
<p>Pre-trained Language Models (PLMs) are trained on extensive text data prior to fine-tuning for specific downstream tasks. Leveraging the Transformer architecture and self-attention mechanisms, PLMs significantly enhance performance in semantic language processing. BERT (Bidirectional Encoder Representations from Transformers) serves as a prime example of a PLM, pre-trained on a large text corpus in an unsupervised manner using masked language modeling and next sentence prediction, followed by fine-tuning for various tasks such as question answering, sentiment analysis, and text classification. BERT has become a leading model in PLMs due to its efficiency, scalability, and outstanding performance across multiple applications.</p>

<h2>Large Language Models</h2>
<p>Large Language Models (LLMs) build upon the foundational research of PLMs, extending their capabilities in natural language processing. Scaling up both the model sizes and the volumes of training data has proven beneficial for enhancing performance on downstream tasks. For instance, GPT-3, a significantly larger PLM with 175 billion parameters, shares a similar architecture and pre-training objectives with standard PLMs. However, GPT-3 not only focuses on model size but also exhibits remarkable capabilities in addressing complex tasks, particularly in creative writing, where it can produce coherent poetry, song lyrics, or fiction aligned with specific styles or themes. In contrast, BERT, with 340 million parameters, is not optimized for creative writing tasks but is adept at completing sentences or predicting missing words.</p>

<h2>Resource Intensity and Quantization Benefits</h2>
<p>The extensive data processing and distributed parallel training requirements of LLMs make it necessary to conduct repetitive studies to explore various training strategies, which can be both resource-intensive and costly. Quantization techniques can alleviate these costs by diminishing computational and resource demands. By reducing the bit count for each model weight, quantization significantly shrinks the overall model size, leading to LLMs that require less memory, take up less storage space, are more energy-efficient, and facilitate faster inference. These advantages empower LLMs to function across a broader spectrum of devices, including embedded systems and single GPU configurations.</p>

<h2>Quantization Techniques</h2>
<p>Two prominent quantization methods employed in LLMs are Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). PTQ minimizes the size and computational demands of a machine learning model after its training, impacting only the inference state. Research initiatives like SmoothQuant have introduced PTQ solutions aimed at reducing hardware expenses and democratizing access to LLMs by enabling 8-bit weight and activation quantization (W8A8). SmoothQuant tackles activation outliers by offline mitigating quantization challenges through a mathematically equivalent transformation, considering that weights are generally easier to quantize than activations.</p>
<p>Another research approach, QuIP, applies PTQ to LLMs, enhancing existing quantization algorithms to yield viable results using just two bits per weight. Conversely, QAT focuses on optimizing models for efficient inference by simulating quantization effects during training. Unlike PTQ, QAT integrates the weight conversion process throughout the training phase.</p>
<p>Research like Degree-Quant improves the inference times of Graph Neural Networks by leveraging QAT, allowing the use of low-precision integer arithmetic during inference. Models trained with Degree-Quant for INT8 quantization frequently perform comparably to FP32 models, while INT4 models may achieve up to a 26% improvement over baseline models. Additionally, EfficientQAT proposes a practical QAT algorithm that reduces memory consumption throughout LLM training. EfficientQAT employs a two-step approach: block-wise training of all parameters (Block-AP) and end-to-end training of quantization parameters (E2E-QP), minimizing accuracy loss in low-bit scenarios. This method has outperformed previous quantization strategies across a range of models, varying from 7B to 70B parameters at different quantization bit levels.</p>
</body>
</html>

0 comments on commit 42c20a6

Please sign in to comment.