Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add README.md #155

Merged
merged 16 commits into from
Apr 25, 2024
Merged

add README.md #155

merged 16 commits into from
Apr 25, 2024

Conversation

jcaip
Copy link
Contributor

@jcaip jcaip commented Apr 22, 2024

add README.md to sparsity folder

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 22, 2024
@jcaip
Copy link
Contributor Author

jcaip commented Apr 22, 2024

adding images here for hosting lol:
Screenshot 2024-04-22 at 1 46 15 PM
Screenshot 2024-04-22 at 1 44 13 PM
Screenshot 2024-04-22 at 1 44 03 PM


# Design

Pruning, like quantization, is an accuracy/performance trade-off, where we care not only about the speedup but also on the accuracy degradation of our architecture optimization technique.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we rename the folder to pruning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think sparsity is the more widely used term, so let's keep it called that. But I'll change Pruning -> Sparsity where it makes sense in the README

@cpuhrsch cpuhrsch requested a review from msaroufim April 22, 2024 22:40
@@ -0,0 +1,664 @@
# torchao sparsity

Sparsity is the technique of removing parameters from a neural network in order to reduce its memory overhead or latency. By carefully choosing the elements that are removed, one can achieve significant reduction in memory overhead and latency, while paying a reasonably low or no price in terms of model quality (accuracy / f1).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call pruning the "technique of removing parameters from a neural network in order to reduce its memory overhead or latency"


Sparsity, like quantization, is an accuracy/performance trade-off, where we care not only about the speedup but also on the accuracy degradation of our architecture optimization technique.

In quantization, the theoretical performance gain is generally determined by the data type that we are quantizing to - quantizing from float32 to float16 yields a theoretical 2x speedup. For pruning/sparsity, the analogous variable would be the sparsity level/ sparsity pattern. For semi-structured, the sparsity level is fixed at 50%, so we expect a theoretical 2x improvement. For block-sparse matrices and unstructured sparsity, the speedup is variable and depends on the sparsity level of the tensor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's roughly a theoretical speedup of 2x. Or put differently, 2x is a very basic estimate just because of the reduce amount of memory that needs to processed. In practice it can vary quite a bit. It could even be a lot more, because it allows you to use faster caches, etc.


In quantization, the theoretical performance gain is generally determined by the data type that we are quantizing to - quantizing from float32 to float16 yields a theoretical 2x speedup. For pruning/sparsity, the analogous variable would be the sparsity level/ sparsity pattern. For semi-structured, the sparsity level is fixed at 50%, so we expect a theoretical 2x improvement. For block-sparse matrices and unstructured sparsity, the speedup is variable and depends on the sparsity level of the tensor.

One key difference between sparsity and quantization is in how the accuracy degradation is determined: The accuracy degradation of quantization is determined by the scale and zero_point chosen. However, in pruning the accuracy degradation is determined by the mask. By carefully choosing the specified elements and retraining the network, pruning can achieve negligible accuracy degradation and in some cases even provide a slight accuracy gain. This is an active area of research with no agreed-upon consensus. We expect users will have a target sparsity pattern and mind and to prune to that pattern.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit biased towards affine quantization and sparsity aware training specifically for matrix multiplication. There's many other variables that influences accuracy degradation. For example the operation used and the distribution of input values.

The measure that is model quality is the same between sparsity and quantization. Some of the mitigation techniques are the same too (e.g. quantization or sparsity aware training). Where it differs, I'd say, is that sparsity explicitly relies on approximating a sum of numbers (hence the focus on zero), whereas in quantization you avoid allocating bits for unused numerical ranges/unnecessary numerical fidelty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add some more context to the end of this section, but for this and the comment above, I want to keep this as newbie friendly as possible, so I think it's okay to have a relative flawed / forceful analogy to make a point.

I think explaining things in the most faithful way introduces a lot of jargon, which is kind of overwhelming.


Given a target sparsity pattern, pruning a model can then be thought of as two separate subproblems:

* How can I find a set of sparse weights which satisfy my target sparsity pattern that minimize the accuracy degradation of my model?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, so this first part is what I'd call pruning.

Given a target sparsity pattern, pruning a model can then be thought of as two separate subproblems:

* How can I find a set of sparse weights which satisfy my target sparsity pattern that minimize the accuracy degradation of my model?
* How can I accelerate my sparse weights for inference and reduced memory overhead?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then sparsity can be the task of accelerating pruned weights. It's not always necessary to use a sparse layout or sparse kernel. Sometimes you can prune in ways that obviate these specialized techniques. For example, you can just skip an entire layer.

Copy link
Member

@msaroufim msaroufim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really enjoyed reading this, it is missing code samples but I believe what you intended to write was closer to a survey of sparsity and the parameter space a library should be in which case I believe this does the job well


FakeSparsity is a parameterization which simulates unstructured sparsity, where each element has a mask. Because of this, we can use it to simulate any sparsity pattern we want.

The user will then train the prepared model using their own custom code, calling .step() to update the mask if necessary. Once they’ve found a suitable mask, they call `squash_mask()` to fuse the mask into the weights, creating a dense tensor with 0s in the right spot.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow this line, what's step? Also this seems to indicate that people need to change their training code and if so how?

is this line also necessary for people only interested in accelerated inference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated with a code sample, that should make this a bit easier to follow.

@msaroufim
Copy link
Member

So we have a docs page now https://github.com/pytorch/ao/blob/jcaip/sparsity-readme/docs/source/sparsity.rst
pytorch.org/ao

@jcaip
Copy link
Contributor Author

jcaip commented Apr 25, 2024

So we have a docs page now https://github.com/pytorch/ao/blob/jcaip/sparsity-readme/docs/source/sparsity.rst pytorch.org/ao

I think this is conceptually the right long term home for most of the stuff in the README, but I feel like this information will get lost right now vs if we put it in the README.

@msaroufim msaroufim self-requested a review April 25, 2024 17:49
@msaroufim msaroufim merged commit 639432b into main Apr 25, 2024
13 checks passed
@msaroufim msaroufim deleted the jcaip/sparsity-readme branch April 25, 2024 18:14
dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024
* added readme

* update

* add README

* update

* fix images

* update

* cleaned up

* fix

* fix formatting

* update

* update readme

* fix images

* updated README again

* update

---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants