Skip to content

Commit

Permalink
Merge pull request #274 from YangZhou1997/yang-typo-format-fix
Browse files Browse the repository at this point in the history
a batch of typo and format fixes
  • Loading branch information
profvjreddi authored Jun 15, 2024
2 parents c41698d + 6707f17 commit 99c8ef8
Show file tree
Hide file tree
Showing 6 changed files with 45 additions and 19 deletions.
2 changes: 1 addition & 1 deletion contents/benchmarking/benchmarking.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,7 @@ The following metrics are often considered important:

4. **Memory Consumption:** The amount of memory the training process uses. Memory consumption can be a limiting factor for training large models or datasets. For example, Google researchers faced significant memory consumption challenges when training BERT. The model has hundreds of millions of parameters, requiring large amounts of memory. The researchers had to develop techniques to reduce memory consumption, such as gradient checkpointing and model parallelism.

5. ** Energy Consumption: ** The energy consumed during training. As machine learning models become more complex, energy consumption has become an important consideration. Training large machine learning models can consume significant energy, leading to a large carbon footprint. For instance, the training of OpenAI's GPT-3 was estimated to have a carbon footprint equivalent to traveling by car for 700,000 kilometers.
5. **Energy Consumption:** The energy consumed during training. As machine learning models become more complex, energy consumption has become an important consideration. Training large machine learning models can consume significant energy, leading to a large carbon footprint. For instance, the training of OpenAI's GPT-3 was estimated to have a carbon footprint equivalent to traveling by car for 700,000 kilometers.

6. **Throughput:** The number of training samples processed per unit time. Higher throughput generally indicates a more efficient training process. The throughput is an important metric to consider when training a recommendation system for an e-commerce platform. A high throughput ensures that the model can process large volumes of user interaction data promptly, which is crucial for maintaining the relevance and accuracy of the recommendations. But it's also important to understand how to balance throughput with latency bounds. Therefore, a latency-bounded throughput constraint is often imposed on service-level agreements for data center application deployments.

Expand Down
4 changes: 2 additions & 2 deletions contents/dl_primer/dl_primer.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -220,13 +220,13 @@ To briefly highlight the differences, @tbl-mlvsdl illustrates the contrasting ch
#### Complexity of the Problem

* **Problem Granularity:** Problems that are simple to moderately complex, which may involve linear or polynomial relationships between variables, often find a better fit with traditional machine learning methods.
  

* **Hierarchical Feature Representation:** Deep learning models are excellent in tasks that require hierarchical feature representation, such as image and speech recognition. However, not all problems require this complexity, and traditional machine learning algorithms may sometimes offer simpler and equally effective solutions.

#### Hardware and Computational Resources

* **Resource Constraints:** The availability of computational resources often influences the choice between traditional ML and deep learning. The former is generally less resource-intensive and thus preferable in environments with hardware limitations or budget constraints.
  

* **Scalability and Speed:** Traditional machine learning algorithms, like support vector machines (SVM), often allow for faster training times and easier scalability, which is particularly beneficial in projects with tight timelines and growing data volumes.

#### Regulatory Compliance
Expand Down
20 changes: 15 additions & 5 deletions contents/frameworks/frameworks.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -302,7 +302,19 @@ Computational graphs can only be as good as the data they learn from and work on

#### Data Loaders

These pipelines' cores are data loaders, which handle reading examples from storage formats like CSV files or image folders. Reading training examples from sources like files, databases, object storage, etc., is the job of the data loaders. Deep learning models require diverse data formats depending on the application. Among the popular formats is CSV, a versatile, simple format often used for tabular data. TFRecord: TensorFlow's proprietary format, optimized for performance. Parquet: Columnar storage, offering efficient data compression and retrieval. JPEG/PNG: Commonly used for image data. WAV/MP3: Prevalent formats for audio data. For instance, `tf.data` is TensorFlows's dataloading pipeline: <https://www.tensorflow.org/guide/data>.
These pipelines' cores are data loaders, which handle reading examples from storage formats like CSV files or image folders. Reading training examples from sources like files, databases, object storage, etc., is the job of the data loaders. Deep learning models require diverse data formats depending on the application. Among the popular formats is

* CSV, a versatile, simple format often used for tabular data.

* TFRecord: TensorFlow's proprietary format, optimized for performance.

* Parquet: Columnar storage, offering efficient data compression and retrieval.

* JPEG/PNG: Commonly used for image data.

* WAV/MP3: Prevalent formats for audio data.

For instance, `tf.data` is TensorFlows's dataloading pipeline: <https://www.tensorflow.org/guide/data>.

Data loaders batch examples to leverage vectorization support in hardware. Batching refers to grouping multiple data points for simultaneous processing, leveraging the vectorized computation capabilities of hardware like GPUs. While typical batch sizes range from 32 to 512 examples, the optimal size often depends on the data's memory footprint and the specific hardware constraints. Advanced loaders can stream virtually unlimited datasets from disk and cloud storage. They stream large datasets from disks or networks instead of fully loading them into memory, enabling unlimited dataset sizes.

Expand Down Expand Up @@ -418,16 +430,14 @@ Today, NVIDIA GPUs dominate training, aided by software libraries like [CUDA](ht

As machine learning models have become larger over the years, it has become essential for large models to utilize multiple computing nodes in the training process. This process, distributed learning, has allowed for higher training capabilities but has also imposed challenges in implementation.

We can consider three different ways to spread the work of training machine learning models to multiple computing nodes. Input data partitioning refers to multiple processors running the same model on different input partitions. This is the easiest implementation and is available for many machine learning frameworks. The more challenging distribution of work comes with model parallelism, which refers to multiple computing nodes working on different parts of the model, and pipelined model parallelism, which refers to multiple computing nodes working on different layers of the model on the same input. The latter two mentioned here are active research areas.
We can consider three different ways to spread the work of training machine learning models to multiple computing nodes. Input data partitioning (or data parallelism) refers to multiple processors running the same model on different input partitions. This is the easiest implementation and is available for many machine learning frameworks. The more challenging distribution of work comes with model parallelism, which refers to multiple computing nodes working on different parts of the model, and pipelined model parallelism, which refers to multiple computing nodes working on different layers of the model on the same input. The latter two mentioned here are active research areas.

ML frameworks that support distributed learning include TensorFlow (through its [tf.distribute](https://www.tensorflow.org/api_docs/python/tf/distribute) module), PyTorch (through its [torch.nn.DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html) and [torch.nn.DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) modules), and MXNet (through its [gluon](https://mxnet.apache.org/versions/1.9.1/api/python/docs/api/gluon/index.html) API).

### Model Conversion

Machine learning models have various methods to be represented and used within different frameworks and for different device types. For example, a model can be converted to be compatible with inference frameworks within the mobile device. The default format for TensorFlow models is checkpoint files containing weights and architectures, which are needed to retrain the models. However, models are typically converted to TensorFlow Lite format for mobile deployment. TensorFlow Lite uses a compact flat buffer representation and optimizations for fast inference on mobile hardware, discarding all the unnecessary baggage associated with training metadata, such as checkpoint file structures.

The default format for TensorFlow models is checkpoint files containing weights and architectures. For mobile deployment, models are typically converted to TensorFlow Lite format. TensorFlow Lite uses a compact flat buffer representation and optimizations for fast inference on mobile hardware.

Model optimizations like quantization (see [Optimizations](../optimizations/optimizations.qmd) chapter) can further optimize models for target architectures like mobile. This reduces the precision of weights and activations to `uint8` or `int8` for a smaller footprint and faster execution with supported hardware accelerators. For post-training quantization, TensorFlow's converter handles analysis and conversion automatically.

Frameworks like TensorFlow simplify deploying trained models to mobile and embedded IoT devices through easy conversion APIs for TFLite format and quantization. Ready-to-use conversion enables high-performance inference on mobile without a manual optimization burden. Besides TFLite, other common targets include TensorFlow.js for web deployment, TensorFlow Serving for cloud services, and TensorFlow Hub for transfer learning. TensorFlow's conversion utilities handle these scenarios to streamline end-to-end workflows.
Expand Down Expand Up @@ -462,7 +472,7 @@ Additional challenges are associated with federated learning. The number of mobi

The heterogeneity of device resources is another hurdle. Devices participating in Federated Learning can have varying computational powers and memory capacities. This diversity makes it challenging to design efficient algorithms across all devices. Privacy and security issues are not a guarantee for federated learning. Techniques such as inversion gradient attacks can extract information about the training data from the model parameters. Despite these challenges, the many potential benefits continue to make it a popular research area. Open source programs such as [Flower](https://flower.dev/) have been developed to simplify implementing federated learning with various machine learning frameworks.

@fig-federated-learning illustrates an example of federated learning. Consider a model used for medical predictions by diffrent hospitals. Given that medical data is extremely sensitive and must be kept private, it can't be transferred to a centralized server for training. Instead, each hospital would firen-tune/train the base model using its own private data, while only communicating non-sensitive information with the Federated Server, such as the learned parameters.
@fig-federated-learning illustrates an example of federated learning. Consider a model used for medical predictions by diffrent hospitals. Given that medical data is extremely sensitive and must be kept private, it can't be transferred to a centralized server for training. Instead, each hospital would fine-tune/train the base model using its own private data, while only communicating non-sensitive information with the Federated Server, such as the learned parameters.

![A centralized-server approach to federated learning. Credit: [NVIDIA.](https://blogs.nvidia.com/blog/what-is-federated-learning/)](images/png/federated_learning.png){#fig-federated-learning}

Expand Down
16 changes: 16 additions & 0 deletions contents/optimizations/optimizations.bib
Original file line number Diff line number Diff line change
@@ -1,5 +1,21 @@
%comment{This file was created with betterbib v5.0.11.}
@inproceedings{yao2021hawq,
title={Hawq-v3: Dyadic neural network quantization},
author={Yao, Zhewei and Dong, Zhen and Zheng, Zhangcheng and Gholami, Amir and Yu, Jiali and Tan, Eric and Wang, Leyuan and Huang, Qijing and Wang, Yida and Mahoney, Michael and others},
booktitle={International Conference on Machine Learning},
pages={11875--11886},
year={2021},
organization={PMLR}
}

@inproceedings{jacob2018quantization,
title={Quantization and training of neural networks for efficient integer-arithmetic-only inference},
author={Jacob, Benoit and Kligys, Skirmantas and Chen, Bo and Zhu, Menglong and Tang, Matthew and Howard, Andrew and Adam, Hartwig and Kalenichenko, Dmitry},
booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
pages={2704--2713},
year={2018}
}

@inproceedings{benmeziane2021hardwareaware,
author = {Benmeziane, Hadjer and El Maghraoui, Kaoutar and Ouarnoughi, Hamza and Niar, Smail and Wistuba, Martin and Wang, Naigang},
Expand Down
4 changes: 2 additions & 2 deletions contents/optimizations/optimizations.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ With **channel** pruning, which is predominantly applied in convolutional neural

Finally, **layer** pruning takes a more aggressive approach by removing entire layers of the network. This significantly reduces the network's depth and thereby its capacity to model complex patterns and hierarchies in the data. This approach necessitates a careful balance to ensure that the model's predictive capability is not unduly compromised.

@fig-channel-layer-pruning demonstrates the difference between channel/filter wise pruning and layer pruning. When we prune a channel, we have to reconfigure the model's architecture in order to adapt to the structural changes. One adjustment is changing the number of input channels in the subsequent layer (here, the third and deepest layer): changing the depths of the filters that are applied to the layer with the pruned channel. On the other hand, pruning an entire layer (removing all the channels in the layer) requires more drastic adjustements. The main one involves modifying the connections between the remaining layers to replace or bypass the pruned layer. In our case, we reconfigured had to connect the first and last layers. In all pruning cases, we have to fine-tune the new structure to adjust the weights.
@fig-channel-layer-pruning demonstrates the difference between channel/filter wise pruning and layer pruning. When we prune a channel, we have to reconfigure the model's architecture in order to adapt to the structural changes. One adjustment is changing the number of input channels in the subsequent layer (here, the third and deepest layer): changing the depths of the filters that are applied to the layer with the pruned channel. On the other hand, pruning an entire layer (removing all the channels in the layer) requires more drastic adjustements. The main one involves modifying the connections between the remaining layers to replace or bypass the pruned layer. In our case, we reconfigure to connect the first and last layers. In all pruning cases, we have to fine-tune the new structure to adjust the weights.

![Channel vs layer pruning.](images/jpg/modeloptimization_channel_layer_pruning.jpeg){#fig-channel-layer-pruning}

Expand Down Expand Up @@ -633,7 +633,7 @@ Of these, channelwise quantization is the current standard used for quantizing c

After determining the type and granularity of the clipping range, practitioners must decide when ranges are determined in their range calibration algorithms. There are two approaches to quantizing activations: static quantization and dynamic quantization.

Static quantization is the most frequently used approach. In this, the clipping range is pre-calculated and static during inference. It does not add any computational overhead, but, consequently, results in lower accuracy as compared to dynamic quantization. A popular method of implementing this is to run a series of calibration inputs to compute the typical range of activations [Quantization and training of neural networks for efficient integer-arithmetic-only inference, Dyadic neural network quantization].
Static quantization is the most frequently used approach. In this, the clipping range is pre-calculated and static during inference. It does not add any computational overhead, but, consequently, results in lower accuracy as compared to dynamic quantization. A popular method of implementing this is to run a series of calibration inputs to compute the typical range of activations [@jacob2018quantization; @yao2021hawq].

Dynamic quantization is an alternative approach which dynamically calculates the range for each activation map during runtime. The approach requires real-time computations which might have a very high overhead. By doing this, dynamic quantization often achieves the highest accuracy as the range is calculated specifically for each input.

Expand Down
Loading

0 comments on commit 99c8ef8

Please sign in to comment.