New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Optimize the cache fetch for forward split, pt. 1 (#2187) #2218

Closed

q10 wants to merge 1 commit into pytorch:main from q10:export-D51865590

Contributor

q10 commented Dec 14, 2023

Summary:

Rewrite the kernel to use cache_hit_rate enum as template argument. We first check if the cache is empty and pass that value as a template argument. Inside the first kernel, we then determine the cache conflict miss rate, and use this value to as a template parameter when invoking the second kernel, which performs the actual lookup work.

We pass in uvm_cache_stats as a run-time argument here instead of passing the cache miss rate as a compile-time argument, because uvm_cache_stats data is only available on the GPU, and incoking a templatized kernel with the cache miss rate as a template argument requires the cache misse information to first be passed back to the host, which is an expensive operation.

This is based on the earlier work in stacks D48937380 and D49675672, which have been based on very outdated branches of fbcode.

Differential Revision: D51865590

netlify bot commented Dec 14, 2023 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs canceled.

Name	Link
🔨 Latest commit	`5e3adb6`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6581e9e6d172fc0008bc910e

facebook-github-bot added cla signed fb-exported labels

Contributor

facebook-github-bot commented Dec 14, 2023

This pull request was exported from Phabricator. Differential Revision: D51865590

q10 force-pushed the export-D51865590 branch from 384e08a to 2b0b81b Compare

December 18, 2023 19:58

q10 added a commit to q10/FBGEMM that referenced this pull request


          Optimize the cache fetch for forward split, pt. 1 (pytorch#2218)

2b0b81b

Summary:


Rewrite the kernel to use cache_hit_rate enum as template argument.  We first check if the cache is empty and pass that value as a template argument.  Inside the first kernel, we then determine the cache conflict miss rate, and use this value to as a template parameter when invoking the second kernel, which performs the actual lookup work.

We pass in uvm_cache_stats as a run-time argument here instead of passing the cache miss rate as a compile-time argument, because uvm_cache_stats data is only available on the GPU, and incoking a templatized kernel with the cache miss rate as a template argument requires the cache misse information to first be passed back to the host, which is an expensive operation.

This is based on the earlier work in stacks D48937380 and D49675672, which have been based on very outdated branches of fbcode.

Differential Revision: D51865590

Contributor

facebook-github-bot commented Dec 18, 2023

This pull request was exported from Phabricator. Differential Revision: D51865590

1 similar comment

Contributor

facebook-github-bot commented Dec 18, 2023

This pull request was exported from Phabricator. Differential Revision: D51865590

q10 added a commit to q10/FBGEMM that referenced this pull request


          Optimize the cache fetch for forward split, pt. 1 (pytorch#2218)

b2c138f

Summary:


Rewrite the kernel to use cache_hit_rate enum as template argument.  We first check if the cache is empty and pass that value as a template argument.  Inside the first kernel, we then determine the cache conflict miss rate, and use this value to as a template parameter when invoking the second kernel, which performs the actual lookup work.

We pass in uvm_cache_stats as a run-time argument here instead of passing the cache miss rate as a compile-time argument, because uvm_cache_stats data is only available on the GPU, and incoking a templatized kernel with the cache miss rate as a template argument requires the cache misse information to first be passed back to the host, which is an expensive operation.

This is based on the earlier work in stacks D48937380 and D49675672, which have been based on very outdated branches of fbcode.

Differential Revision: D51865590

q10 added a commit to q10/FBGEMM that referenced this pull request


          Optimize the cache fetch for forward split, pt. 1 (pytorch#2218)

0c2f59f

Summary:


Rewrite the kernel to use cache_hit_rate enum as template argument.  We first check if the cache is empty and pass that value as a template argument.  Inside the first kernel, we then determine the cache conflict miss rate, and use this value to as a template parameter when invoking the second kernel, which performs the actual lookup work.

We pass in uvm_cache_stats as a run-time argument here instead of passing the cache miss rate as a compile-time argument, because uvm_cache_stats data is only available on the GPU, and incoking a templatized kernel with the cache miss rate as a template argument requires the cache misse information to first be passed back to the host, which is an expensive operation.

This is based on the earlier work in stacks D48937380 and D49675672, which have been based on very outdated branches of fbcode.

Differential Revision: D51865590

q10 added a commit to q10/FBGEMM that referenced this pull request


          Optimize the cache fetch for forward split, pt. 1 (pytorch#2218)

9f4319d

Summary:


Rewrite the kernel to use cache_hit_rate enum as template argument.  We first check if the cache is empty and pass that value as a template argument.  Inside the first kernel, we then determine the cache conflict miss rate, and use this value to as a template parameter when invoking the second kernel, which performs the actual lookup work.

We pass in uvm_cache_stats as a run-time argument here instead of passing the cache miss rate as a compile-time argument, because uvm_cache_stats data is only available on the GPU, and incoking a templatized kernel with the cache miss rate as a template argument requires the cache misse information to first be passed back to the host, which is an expensive operation.

This is based on the earlier work in stacks D48937380 and D49675672, which have been based on very outdated branches of fbcode.

Differential Revision: D51865590

q10 force-pushed the export-D51865590 branch from 2b0b81b to ee553e2 Compare

December 18, 2023 20:10

Contributor

facebook-github-bot commented Dec 18, 2023

This pull request was exported from Phabricator. Differential Revision: D51865590


          Optimize the cache fetch for forward split, pt. 1 (pytorch#2218)

5e3adb6

Summary:


Rewrite the kernel to use cache_hit_rate enum as template argument.  We first check if the cache is empty and pass that value as a template argument.  Inside the first kernel, we then determine the cache conflict miss rate, and use this value to as a template parameter when invoking the second kernel, which performs the actual lookup work.

We pass in uvm_cache_stats as a run-time argument here instead of passing the cache miss rate as a compile-time argument, because uvm_cache_stats data is only available on the GPU, and incoking a templatized kernel with the cache miss rate as a template argument requires the cache misse information to first be passed back to the host, which is an expensive operation.

This is based on the earlier work in stacks D48937380 and D49675672, which have been based on very outdated branches of fbcode.

Differential Revision: D51865590

q10 force-pushed the export-D51865590 branch from ee553e2 to 5e3adb6 Compare

December 19, 2023 19:07

Contributor

facebook-github-bot commented Dec 19, 2023

This pull request was exported from Phabricator. Differential Revision: D51865590

1 similar comment

Contributor

facebook-github-bot commented Dec 19, 2023

This pull request was exported from Phabricator. Differential Revision: D51865590

q10 added a commit to q10/FBGEMM that referenced this pull request


          Optimize the cache fetch for forward split, pt. 1 (pytorch#2218)

936cd7c

Summary:


Rewrite the kernel to use cache_hit_rate enum as template argument.  We first check if the cache is empty and pass that value as a template argument.  Inside the first kernel, we then determine the cache conflict miss rate, and use this value to as a template parameter when invoking the second kernel, which performs the actual lookup work.

We pass in uvm_cache_stats as a run-time argument here instead of passing the cache miss rate as a compile-time argument, because uvm_cache_stats data is only available on the GPU, and incoking a templatized kernel with the cache miss rate as a template argument requires the cache misse information to first be passed back to the host, which is an expensive operation.

This is based on the earlier work in stacks D48937380 and D49675672, which have been based on very outdated branches of fbcode.

Differential Revision: D51865590

q10 added a commit to q10/FBGEMM that referenced this pull request


          Optimize the cache fetch for forward split, pt. 1 (pytorch#2218)

70af3b7

Summary:


Rewrite the kernel to use cache_hit_rate enum as template argument.  We first check if the cache is empty and pass that value as a template argument.  Inside the first kernel, we then determine the cache conflict miss rate, and use this value to as a template parameter when invoking the second kernel, which performs the actual lookup work.

We pass in uvm_cache_stats as a run-time argument here instead of passing the cache miss rate as a compile-time argument, because uvm_cache_stats data is only available on the GPU, and incoking a templatized kernel with the cache miss rate as a template argument requires the cache misse information to first be passed back to the host, which is an expensive operation.

This is based on the earlier work in stacks D48937380 and D49675672, which have been based on very outdated branches of fbcode.

Differential Revision: D51865590

q10 added a commit to q10/FBGEMM that referenced this pull request


          Optimize the cache fetch for forward split, pt. 1 (pytorch#2218)

Summary:


Rewrite the kernel to use cache_hit_rate enum as template argument.  We first check if the cache is empty and pass that value as a template argument.  Inside the first kernel, we then determine the cache conflict miss rate, and use this value to as a template parameter when invoking the second kernel, which performs the actual lookup work.

We pass in uvm_cache_stats as a run-time argument here instead of passing the cache miss rate as a compile-time argument, because uvm_cache_stats data is only available on the GPU, and incoking a templatized kernel with the cache miss rate as a template argument requires the cache misse information to first be passed back to the host, which is an expensive operation.

This is based on the earlier work in stacks D48937380 and D49675672, which have been based on very outdated branches of fbcode.

Differential Revision: D51865590

q10 added a commit to q10/FBGEMM that referenced this pull request


          Optimize the cache fetch for forward split, pt. 1 (pytorch#2218)

be77499

Summary:


Rewrite the kernel to use cache_hit_rate enum as template argument.  We first check if the cache is empty and pass that value as a template argument.  Inside the first kernel, we then determine the cache conflict miss rate, and use this value to as a template parameter when invoking the second kernel, which performs the actual lookup work.

We pass in uvm_cache_stats as a run-time argument here instead of passing the cache miss rate as a compile-time argument, because uvm_cache_stats data is only available on the GPU, and incoking a templatized kernel with the cache miss rate as a template argument requires the cache misse information to first be passed back to the host, which is an expensive operation.

This is based on the earlier work in stacks D48937380 and D49675672, which have been based on very outdated branches of fbcode.

Reviewed By: spcyppt

Differential Revision: D51865590

q10 added a commit to q10/FBGEMM that referenced this pull request


          Optimize the cache fetch for forward split, pt. 1 (pytorch#2218)

b73c8a6

Summary:


Rewrite the kernel to use cache_hit_rate enum as template argument.  We first check if the cache is empty and pass that value as a template argument.  Inside the first kernel, we then determine the cache conflict miss rate, and use this value to as a template parameter when invoking the second kernel, which performs the actual lookup work.

We pass in uvm_cache_stats as a run-time argument here instead of passing the cache miss rate as a compile-time argument, because uvm_cache_stats data is only available on the GPU, and incoking a templatized kernel with the cache miss rate as a template argument requires the cache misse information to first be passed back to the host, which is an expensive operation.

This is based on the earlier work in stacks D48937380 and D49675672, which have been based on very outdated branches of fbcode.

Reviewed By: spcyppt

Differential Revision: D51865590

q10 added a commit to q10/FBGEMM that referenced this pull request


          Optimize the cache fetch for forward split, pt. 1 (pytorch#2218)

82bbc93

Summary:


Rewrite the kernel to use cache_hit_rate enum as template argument.  We first check if the cache is empty and pass that value as a template argument.  Inside the first kernel, we then determine the cache conflict miss rate, and use this value to as a template parameter when invoking the second kernel, which performs the actual lookup work.

We pass in uvm_cache_stats as a run-time argument here instead of passing the cache miss rate as a compile-time argument, because uvm_cache_stats data is only available on the GPU, and incoking a templatized kernel with the cache miss rate as a template argument requires the cache misse information to first be passed back to the host, which is an expensive operation.

This is based on the earlier work in stacks D48937380 and D49675672, which have been based on very outdated branches of fbcode.

Reviewed By: spcyppt

Differential Revision: D51865590

facebook-github-bot closed this in

7dd0c7f

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Dec 27, 2023

This pull request has been merged in 7dd0c7f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported Merged