Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the scratch pad tensor UVA #2844

Closed
wants to merge 4 commits into from
Closed

Conversation

sryap
Copy link
Contributor

@sryap sryap commented Jul 14, 2024

Summary:
Before this diff, the scratch pad in SSD TBE (see D55998215 for more
detail) was a CPU tensor which was later transferred to GPU to allow
the TBE kernels to access it. The scratch pad tranfer was highly
inefficient since TBE over provisioned the scratch pad buffer
allocation (as it did not know the exact number of cache missed rows)
causing extra data transfer. Such the extra data transfer could be
large since the number of cache missed rows was normally much smaller
than value that TBE over provisioned.

There are two ways to avoid the extra data transfer:

(1) Let TBE have the exact number of cache missed rows on host which
requires device-to-host data transfer which will cause a sync point
between host and device (not desirable in most trainings).
However, this will allow TBE to use cudaMemcpy which will utilize
the DMA engine and will allow the memory copy to overlap efficiently
with other compute kernels.

(2) Make the scratch pad accessible by both CPU and GPU. In other
words, make the scratch pad a UVA tensor. This does not require
device and host synchornization. However, the memory copy has to be
done through CUDA load/store which requires a kernel to run on SMs.
Thus, the memory copy and compute kernel overlapping will require a
careful SMs management.

Based on the tradeoffs explained above, we chose to implement (2)
to avoid the host and device sync point.

Differential Revision: D58631974

Copy link

netlify bot commented Jul 14, 2024

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 36af07f
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6698391213797500085a9d0a
😎 Deploy Preview https://deploy-preview-2844--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58631974

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58631974

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 14, 2024
Summary:
Pull Request resolved: pytorch#2844

Before this diff, the scratch pad in SSD TBE (see D55998215 for more
detail) was a CPU tensor which was later transferred to GPU to allow
the TBE kernels to access it.  The scratch pad tranfer was highly
inefficient since TBE over provisioned the scratch pad buffer
allocation (as it did not know the exact number of cache missed rows)
causing extra data transfer.  Such the extra data transfer could be
large since the number of cache missed rows was normally much smaller
than value that TBE over provisioned.

There are two ways to avoid the extra data transfer:

(1) Let TBE have the exact number of cache missed rows on host which
requires device-to-host data transfer which will cause a sync point
between host and device (not desirable in most trainings).
However, this will allow TBE to use `cudaMemcpy` which will utilize
the DMA engine and will allow the memory copy to overlap efficiently
with other compute kernels.

(2) Make the scratch pad accessible by both CPU and GPU.  In other
words, make the scratch pad a UVA tensor.  This does not require
device and host synchornization.  However, the memory copy has to be
done through CUDA load/store which requires a kernel to run on SMs.
Thus, the memory copy and compute kernel overlapping will require a
careful SMs management.

Based on the tradeoffs explained above, we chose to implement (2)
to avoid the host and device sync point.

Differential Revision: D58631974
@sryap sryap force-pushed the export-D58631974 branch from 5980e4d to cb0be42 Compare July 14, 2024 21:04
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58631974

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 14, 2024
Summary:
Pull Request resolved: pytorch#2844

Before this diff, the scratch pad in SSD TBE (see D55998215 for more
detail) was a CPU tensor which was later transferred to GPU to allow
the TBE kernels to access it.  The scratch pad tranfer was highly
inefficient since TBE over provisioned the scratch pad buffer
allocation (as it did not know the exact number of cache missed rows)
causing extra data transfer.  Such the extra data transfer could be
large since the number of cache missed rows was normally much smaller
than value that TBE over provisioned.

There are two ways to avoid the extra data transfer:

(1) Let TBE have the exact number of cache missed rows on host which
requires device-to-host data transfer which will cause a sync point
between host and device (not desirable in most trainings).
However, this will allow TBE to use `cudaMemcpy` which will utilize
the DMA engine and will allow the memory copy to overlap efficiently
with other compute kernels.

(2) Make the scratch pad accessible by both CPU and GPU.  In other
words, make the scratch pad a UVA tensor.  This does not require
device and host synchornization.  However, the memory copy has to be
done through CUDA load/store which requires a kernel to run on SMs.
Thus, the memory copy and compute kernel overlapping will require a
careful SMs management.

Based on the tradeoffs explained above, we chose to implement (2)
to avoid the host and device sync point.

Differential Revision: D58631974
@sryap sryap force-pushed the export-D58631974 branch from cb0be42 to 82bffd1 Compare July 14, 2024 21:10
sarunya and others added 4 commits July 16, 2024 14:20
Differential Revision: D59795139
Differential Revision: D59716516
Differential Revision: D59866892
Summary:
Pull Request resolved: pytorch#2844

Before this diff, the scratch pad in SSD TBE (see D55998215 for more
detail) was a CPU tensor which was later transferred to GPU to allow
the TBE kernels to access it.  The scratch pad tranfer was highly
inefficient since TBE over provisioned the scratch pad buffer
allocation (as it did not know the exact number of cache missed rows)
causing extra data transfer.  Such the extra data transfer could be
large since the number of cache missed rows was normally much smaller
than value that TBE over provisioned.

There are two ways to avoid the extra data transfer:

(1) Let TBE have the exact number of cache missed rows on host which
requires device-to-host data transfer which will cause a sync point
between host and device (not desirable in most trainings).
However, this will allow TBE to use `cudaMemcpy` which will utilize
the DMA engine and will allow the memory copy to overlap efficiently
with other compute kernels.

(2) Make the scratch pad accessible by both CPU and GPU.  In other
words, make the scratch pad a UVA tensor.  This does not require
device and host synchornization.  However, the memory copy has to be
done through CUDA load/store which requires a kernel to run on SMs.
Thus, the memory copy and compute kernel overlapping will require a
careful SMs management.

Based on the tradeoffs explained above, we chose to implement (2)
to avoid the host and device sync point.

Reviewed By: q10

Differential Revision: D58631974
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58631974

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in c44c2d4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants