Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the index_select operation for dim=0 #1113

Closed
wants to merge 1 commit into from

Conversation

sryap
Copy link
Contributor

@sryap sryap commented May 10, 2022

Summary:
The index_select operation is not well optimized in PyTorch, especially the
dim=0 case. It is shown to be one of the main bottlenecks in one of the
models.

This patch optimizes the index_select operation as well as its backward
counterpart (i.e., index_add_select) for the dim=0 case.

Optimizations in index_select:

  • Using sorted indices to promote data access locality
  • Using __ldg to leverage texture cache (read-only data cache)
  • Using a for-loop for UNROLL_FACTOR instead of a manual unroll (gives the same
    performance but easier for adjusting UNROLL_FACTOR)

Optimizations in index_add_select:

  • Writing intermediate results to the local buffer instead of the global memory
    buffer
  • Using unique indices to eliminate the empty thread blocks (thread blocks that
    are launched but return right away because another thread block already
    processes the index that they get from the sorted indices list)
  • Using 2D grid size to compute large columns in different blocks (gives the
    same performance but could be useful for the other large D cases)
  • Using __ldg to leverage texture cache
  • Using UNROLL_FACTOR=4 for FP32 and UNROLL_FACTOR=2 for FP16
  • Adding the consecutive_range_start and consecutive_range length
    flags for informing the operation to infer unique indices and the
    number of unique indices from the consecutive indices range.
    • In some models, rows are selected from a consecutive range. With
      this property, we are able to infer unique indices and the number of
      unique indices from the consecutive indices range. In the backward
      op, since we already know the unique indices and the number of
      unique indices, we can skip the unique operation. The performance
      improvement are two folds: (1) no host-device synchronization
      because of the resize op in unique, and (2) the additional operation
      for computing the frequency of each index is lighter weight than the
      unique operation.

Reviewed By: jianyuh, mjanderson09

Differential Revision: D35920450

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35920450

Summary:
Pull Request resolved: pytorch#1113

The index_select operation is not well optimized in PyTorch, especially the
dim=0 case.  It is shown to be one of the main bottlenecks in one of the
models.

This patch optimizes the index_select operation as well as its backward
counterpart (i.e., index_add_select) for the dim=0 case.

Optimizations in index_select:
- Using sorted indices to promote data access locality
- Using __ldg to leverage texture cache (read-only data cache)
- Using a for-loop for UNROLL_FACTOR instead of a manual unroll (gives the same
  performance but easier for adjusting UNROLL_FACTOR)

Optimizations in index_add_select:
- Writing intermediate results to the local buffer instead of the global memory
  buffer
- Using unique indices to eliminate the empty thread blocks (thread blocks that
  are launched but return right away because another thread block already
  processes the index that they get from the sorted indices list)
- Using 2D grid size to compute large columns in different blocks (gives the
  same performance but could be useful for the other large D cases)
- Using __ldg to leverage texture cache
- Using UNROLL_FACTOR=4 for FP32 and UNROLL_FACTOR=2 for FP16
- Adding the consecutive_range_start and consecutive_range length
   flags for informing the operation to infer unique indices and the
   number of unique indices from the consecutive indices range.
   - In some models, rows are selected from a consecutive range.  With
     this property, we are able to infer unique indices and the number of
     unique indices from the consecutive indices range.  In the backward
     op, since we already know the unique indices and the number of
     unique indices, we can skip the unique operation.  The performance
     improvement are two folds: (1) no host-device synchronization
     because of the resize op in unique, and (2) the additional operation
     for computing the frequency of each index is lighter weight than the
     unique operation.

Reviewed By: jianyuh, mjanderson09

Differential Revision: D35920450

fbshipit-source-id: 1de9383ce71bfb341671ff403607659b5360898d
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D35920450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants