Optimize the index_select operation for dim=0 #1113

sryap · 2022-05-10T23:52:15Z

Summary:
The index_select operation is not well optimized in PyTorch, especially the
dim=0 case. It is shown to be one of the main bottlenecks in one of the
models.

This patch optimizes the index_select operation as well as its backward
counterpart (i.e., index_add_select) for the dim=0 case.

Optimizations in index_select:

Using sorted indices to promote data access locality
Using __ldg to leverage texture cache (read-only data cache)
Using a for-loop for UNROLL_FACTOR instead of a manual unroll (gives the same
performance but easier for adjusting UNROLL_FACTOR)

Optimizations in index_add_select:

Writing intermediate results to the local buffer instead of the global memory
buffer
Using unique indices to eliminate the empty thread blocks (thread blocks that
are launched but return right away because another thread block already
processes the index that they get from the sorted indices list)
Using 2D grid size to compute large columns in different blocks (gives the
same performance but could be useful for the other large D cases)
Using __ldg to leverage texture cache
Using UNROLL_FACTOR=4 for FP32 and UNROLL_FACTOR=2 for FP16
Adding the consecutive_range_start and consecutive_range length
flags for informing the operation to infer unique indices and the
number of unique indices from the consecutive indices range.
- In some models, rows are selected from a consecutive range. With
  this property, we are able to infer unique indices and the number of
  unique indices from the consecutive indices range. In the backward
  op, since we already know the unique indices and the number of
  unique indices, we can skip the unique operation. The performance
  improvement are two folds: (1) no host-device synchronization
  because of the resize op in unique, and (2) the additional operation
  for computing the frequency of each index is lighter weight than the
  unique operation.

Reviewed By: jianyuh, mjanderson09

Differential Revision: D35920450

facebook-github-bot · 2022-05-10T23:52:41Z

This pull request was exported from Phabricator. Differential Revision: D35920450

Summary: Pull Request resolved: pytorch#1113 The index_select operation is not well optimized in PyTorch, especially the dim=0 case. It is shown to be one of the main bottlenecks in one of the models. This patch optimizes the index_select operation as well as its backward counterpart (i.e., index_add_select) for the dim=0 case. Optimizations in index_select: - Using sorted indices to promote data access locality - Using __ldg to leverage texture cache (read-only data cache) - Using a for-loop for UNROLL_FACTOR instead of a manual unroll (gives the same performance but easier for adjusting UNROLL_FACTOR) Optimizations in index_add_select: - Writing intermediate results to the local buffer instead of the global memory buffer - Using unique indices to eliminate the empty thread blocks (thread blocks that are launched but return right away because another thread block already processes the index that they get from the sorted indices list) - Using 2D grid size to compute large columns in different blocks (gives the same performance but could be useful for the other large D cases) - Using __ldg to leverage texture cache - Using UNROLL_FACTOR=4 for FP32 and UNROLL_FACTOR=2 for FP16 - Adding the consecutive_range_start and consecutive_range length flags for informing the operation to infer unique indices and the number of unique indices from the consecutive indices range. - In some models, rows are selected from a consecutive range. With this property, we are able to infer unique indices and the number of unique indices from the consecutive indices range. In the backward op, since we already know the unique indices and the number of unique indices, we can skip the unique operation. The performance improvement are two folds: (1) no host-device synchronization because of the resize op in unique, and (2) the additional operation for computing the frequency of each index is lighter weight than the unique operation. Reviewed By: jianyuh, mjanderson09 Differential Revision: D35920450 fbshipit-source-id: 1de9383ce71bfb341671ff403607659b5360898d

facebook-github-bot · 2022-05-12T18:15:58Z

This pull request was exported from Phabricator. Differential Revision: D35920450

facebook-github-bot added cla signed fb-exported labels May 10, 2022

sryap force-pushed the export-D35920450 branch from 29303cf to 1981210 Compare May 12, 2022 18:15

facebook-github-bot closed this in 4f13cb5 May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize the index_select operation for dim=0 #1113

Optimize the index_select operation for dim=0 #1113

sryap commented May 10, 2022

facebook-github-bot commented May 10, 2022

facebook-github-bot commented May 12, 2022

Optimize the index_select operation for dim=0 #1113

Optimize the index_select operation for dim=0 #1113

Conversation

sryap commented May 10, 2022

facebook-github-bot commented May 10, 2022

facebook-github-bot commented May 12, 2022