Make the scratch pad tensor UVA #2844

sryap · 2024-07-14T02:41:58Z

Summary:
Before this diff, the scratch pad in SSD TBE (see D55998215 for more
detail) was a CPU tensor which was later transferred to GPU to allow
the TBE kernels to access it. The scratch pad tranfer was highly
inefficient since TBE over provisioned the scratch pad buffer
allocation (as it did not know the exact number of cache missed rows)
causing extra data transfer. Such the extra data transfer could be
large since the number of cache missed rows was normally much smaller
than value that TBE over provisioned.

There are two ways to avoid the extra data transfer:

(1) Let TBE have the exact number of cache missed rows on host which
requires device-to-host data transfer which will cause a sync point
between host and device (not desirable in most trainings).
However, this will allow TBE to use cudaMemcpy which will utilize
the DMA engine and will allow the memory copy to overlap efficiently
with other compute kernels.

(2) Make the scratch pad accessible by both CPU and GPU. In other
words, make the scratch pad a UVA tensor. This does not require
device and host synchornization. However, the memory copy has to be
done through CUDA load/store which requires a kernel to run on SMs.
Thus, the memory copy and compute kernel overlapping will require a
careful SMs management.

Based on the tradeoffs explained above, we chose to implement (2)
to avoid the host and device sync point.

Differential Revision: D58631974

netlify · 2024-07-14T02:42:17Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`36af07f`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/6698391213797500085a9d0a
😎 Deploy Preview	https://deploy-preview-2844--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

facebook-github-bot · 2024-07-14T02:42:21Z

This pull request was exported from Phabricator. Differential Revision: D58631974

facebook-github-bot · 2024-07-14T21:04:38Z

This pull request was exported from Phabricator. Differential Revision: D58631974

Summary: Pull Request resolved: pytorch#2844 Before this diff, the scratch pad in SSD TBE (see D55998215 for more detail) was a CPU tensor which was later transferred to GPU to allow the TBE kernels to access it. The scratch pad tranfer was highly inefficient since TBE over provisioned the scratch pad buffer allocation (as it did not know the exact number of cache missed rows) causing extra data transfer. Such the extra data transfer could be large since the number of cache missed rows was normally much smaller than value that TBE over provisioned. There are two ways to avoid the extra data transfer: (1) Let TBE have the exact number of cache missed rows on host which requires device-to-host data transfer which will cause a sync point between host and device (not desirable in most trainings). However, this will allow TBE to use `cudaMemcpy` which will utilize the DMA engine and will allow the memory copy to overlap efficiently with other compute kernels. (2) Make the scratch pad accessible by both CPU and GPU. In other words, make the scratch pad a UVA tensor. This does not require device and host synchornization. However, the memory copy has to be done through CUDA load/store which requires a kernel to run on SMs. Thus, the memory copy and compute kernel overlapping will require a careful SMs management. Based on the tradeoffs explained above, we chose to implement (2) to avoid the host and device sync point. Differential Revision: D58631974

facebook-github-bot · 2024-07-14T21:10:24Z

This pull request was exported from Phabricator. Differential Revision: D58631974

Summary: Pull Request resolved: pytorch#2844 Before this diff, the scratch pad in SSD TBE (see D55998215 for more detail) was a CPU tensor which was later transferred to GPU to allow the TBE kernels to access it. The scratch pad tranfer was highly inefficient since TBE over provisioned the scratch pad buffer allocation (as it did not know the exact number of cache missed rows) causing extra data transfer. Such the extra data transfer could be large since the number of cache missed rows was normally much smaller than value that TBE over provisioned. There are two ways to avoid the extra data transfer: (1) Let TBE have the exact number of cache missed rows on host which requires device-to-host data transfer which will cause a sync point between host and device (not desirable in most trainings). However, this will allow TBE to use `cudaMemcpy` which will utilize the DMA engine and will allow the memory copy to overlap efficiently with other compute kernels. (2) Make the scratch pad accessible by both CPU and GPU. In other words, make the scratch pad a UVA tensor. This does not require device and host synchornization. However, the memory copy has to be done through CUDA load/store which requires a kernel to run on SMs. Thus, the memory copy and compute kernel overlapping will require a careful SMs management. Based on the tradeoffs explained above, we chose to implement (2) to avoid the host and device sync point. Differential Revision: D58631974

Differential Revision: D59795139

Differential Revision: D59716516

Differential Revision: D59866892

Summary: Pull Request resolved: pytorch#2844 Before this diff, the scratch pad in SSD TBE (see D55998215 for more detail) was a CPU tensor which was later transferred to GPU to allow the TBE kernels to access it. The scratch pad tranfer was highly inefficient since TBE over provisioned the scratch pad buffer allocation (as it did not know the exact number of cache missed rows) causing extra data transfer. Such the extra data transfer could be large since the number of cache missed rows was normally much smaller than value that TBE over provisioned. There are two ways to avoid the extra data transfer: (1) Let TBE have the exact number of cache missed rows on host which requires device-to-host data transfer which will cause a sync point between host and device (not desirable in most trainings). However, this will allow TBE to use `cudaMemcpy` which will utilize the DMA engine and will allow the memory copy to overlap efficiently with other compute kernels. (2) Make the scratch pad accessible by both CPU and GPU. In other words, make the scratch pad a UVA tensor. This does not require device and host synchornization. However, the memory copy has to be done through CUDA load/store which requires a kernel to run on SMs. Thus, the memory copy and compute kernel overlapping will require a careful SMs management. Based on the tradeoffs explained above, we chose to implement (2) to avoid the host and device sync point. Reviewed By: q10 Differential Revision: D58631974

facebook-github-bot · 2024-07-17T21:35:09Z

This pull request was exported from Phabricator. Differential Revision: D58631974

facebook-github-bot · 2024-07-19T22:44:23Z

This pull request has been merged in c44c2d4.

facebook-github-bot added the cla signed label Jul 14, 2024

facebook-github-bot added the fb-exported label Jul 14, 2024

sryap force-pushed the export-D58631974 branch from 5980e4d to cb0be42 Compare July 14, 2024 21:04

sryap force-pushed the export-D58631974 branch from cb0be42 to 82bffd1 Compare July 14, 2024 21:10

sarunya and others added 4 commits July 16, 2024 14:20

Remove sync point from benchmark_requests

aa09bf4

Differential Revision: D59795139

Fix stream sync for scratch pad eviction

7e25069

Differential Revision: D59716516

Increase memcpy and compute overlap

c9e6a0e

Differential Revision: D59866892

sryap force-pushed the export-D58631974 branch from 82bffd1 to 36af07f Compare July 17, 2024 21:35

facebook-github-bot closed this in c44c2d4 Jul 19, 2024

facebook-github-bot added the Merged label Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make the scratch pad tensor UVA #2844

Make the scratch pad tensor UVA #2844

sryap commented Jul 14, 2024

netlify bot commented Jul 14, 2024 •

edited

Loading

facebook-github-bot commented Jul 14, 2024

facebook-github-bot commented Jul 14, 2024

facebook-github-bot commented Jul 14, 2024

facebook-github-bot commented Jul 17, 2024

facebook-github-bot commented Jul 19, 2024

Make the scratch pad tensor UVA #2844

Make the scratch pad tensor UVA #2844

Conversation

sryap commented Jul 14, 2024

netlify bot commented Jul 14, 2024 • edited Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

facebook-github-bot commented Jul 14, 2024

facebook-github-bot commented Jul 14, 2024

facebook-github-bot commented Jul 14, 2024

facebook-github-bot commented Jul 17, 2024

facebook-github-bot commented Jul 19, 2024

netlify bot commented Jul 14, 2024 •

edited

Loading