-
Notifications
You must be signed in to change notification settings - Fork 520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make the scratch pad tensor UVA #2844
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
This pull request was exported from Phabricator. Differential Revision: D58631974 |
This pull request was exported from Phabricator. Differential Revision: D58631974 |
Summary: Pull Request resolved: pytorch#2844 Before this diff, the scratch pad in SSD TBE (see D55998215 for more detail) was a CPU tensor which was later transferred to GPU to allow the TBE kernels to access it. The scratch pad tranfer was highly inefficient since TBE over provisioned the scratch pad buffer allocation (as it did not know the exact number of cache missed rows) causing extra data transfer. Such the extra data transfer could be large since the number of cache missed rows was normally much smaller than value that TBE over provisioned. There are two ways to avoid the extra data transfer: (1) Let TBE have the exact number of cache missed rows on host which requires device-to-host data transfer which will cause a sync point between host and device (not desirable in most trainings). However, this will allow TBE to use `cudaMemcpy` which will utilize the DMA engine and will allow the memory copy to overlap efficiently with other compute kernels. (2) Make the scratch pad accessible by both CPU and GPU. In other words, make the scratch pad a UVA tensor. This does not require device and host synchornization. However, the memory copy has to be done through CUDA load/store which requires a kernel to run on SMs. Thus, the memory copy and compute kernel overlapping will require a careful SMs management. Based on the tradeoffs explained above, we chose to implement (2) to avoid the host and device sync point. Differential Revision: D58631974
This pull request was exported from Phabricator. Differential Revision: D58631974 |
Summary: Pull Request resolved: pytorch#2844 Before this diff, the scratch pad in SSD TBE (see D55998215 for more detail) was a CPU tensor which was later transferred to GPU to allow the TBE kernels to access it. The scratch pad tranfer was highly inefficient since TBE over provisioned the scratch pad buffer allocation (as it did not know the exact number of cache missed rows) causing extra data transfer. Such the extra data transfer could be large since the number of cache missed rows was normally much smaller than value that TBE over provisioned. There are two ways to avoid the extra data transfer: (1) Let TBE have the exact number of cache missed rows on host which requires device-to-host data transfer which will cause a sync point between host and device (not desirable in most trainings). However, this will allow TBE to use `cudaMemcpy` which will utilize the DMA engine and will allow the memory copy to overlap efficiently with other compute kernels. (2) Make the scratch pad accessible by both CPU and GPU. In other words, make the scratch pad a UVA tensor. This does not require device and host synchornization. However, the memory copy has to be done through CUDA load/store which requires a kernel to run on SMs. Thus, the memory copy and compute kernel overlapping will require a careful SMs management. Based on the tradeoffs explained above, we chose to implement (2) to avoid the host and device sync point. Differential Revision: D58631974
Differential Revision: D59795139
Differential Revision: D59716516
Differential Revision: D59866892
Summary: Pull Request resolved: pytorch#2844 Before this diff, the scratch pad in SSD TBE (see D55998215 for more detail) was a CPU tensor which was later transferred to GPU to allow the TBE kernels to access it. The scratch pad tranfer was highly inefficient since TBE over provisioned the scratch pad buffer allocation (as it did not know the exact number of cache missed rows) causing extra data transfer. Such the extra data transfer could be large since the number of cache missed rows was normally much smaller than value that TBE over provisioned. There are two ways to avoid the extra data transfer: (1) Let TBE have the exact number of cache missed rows on host which requires device-to-host data transfer which will cause a sync point between host and device (not desirable in most trainings). However, this will allow TBE to use `cudaMemcpy` which will utilize the DMA engine and will allow the memory copy to overlap efficiently with other compute kernels. (2) Make the scratch pad accessible by both CPU and GPU. In other words, make the scratch pad a UVA tensor. This does not require device and host synchornization. However, the memory copy has to be done through CUDA load/store which requires a kernel to run on SMs. Thus, the memory copy and compute kernel overlapping will require a careful SMs management. Based on the tradeoffs explained above, we chose to implement (2) to avoid the host and device sync point. Reviewed By: q10 Differential Revision: D58631974
This pull request was exported from Phabricator. Differential Revision: D58631974 |
This pull request has been merged in c44c2d4. |
Summary:
Before this diff, the scratch pad in SSD TBE (see D55998215 for more
detail) was a CPU tensor which was later transferred to GPU to allow
the TBE kernels to access it. The scratch pad tranfer was highly
inefficient since TBE over provisioned the scratch pad buffer
allocation (as it did not know the exact number of cache missed rows)
causing extra data transfer. Such the extra data transfer could be
large since the number of cache missed rows was normally much smaller
than value that TBE over provisioned.
There are two ways to avoid the extra data transfer:
(1) Let TBE have the exact number of cache missed rows on host which
requires device-to-host data transfer which will cause a sync point
between host and device (not desirable in most trainings).
However, this will allow TBE to use
cudaMemcpy
which will utilizethe DMA engine and will allow the memory copy to overlap efficiently
with other compute kernels.
(2) Make the scratch pad accessible by both CPU and GPU. In other
words, make the scratch pad a UVA tensor. This does not require
device and host synchornization. However, the memory copy has to be
done through CUDA load/store which requires a kernel to run on SMs.
Thus, the memory copy and compute kernel overlapping will require a
careful SMs management.
Based on the tradeoffs explained above, we chose to implement (2)
to avoid the host and device sync point.
Differential Revision: D58631974