-
Notifications
You must be signed in to change notification settings - Fork 523
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batch_index_select_dim0 (w/ TBE backend) #1897
Conversation
✅ Deploy Preview for pytorch-fbgemm-docs canceled.
|
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Differential Revision: D46084590 fbshipit-source-id: a5ec8c2d45ae39d5eb79b61a8263e112276de50f
This pull request was exported from Phabricator. Differential Revision: D46084590 |
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Differential Revision: D46084590 fbshipit-source-id: 96fb152d0270e2d09127fbaab349b5ac02068bcb
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Differential Revision: D46084590 fbshipit-source-id: 69df7e36784e77bc4d06cec2e9aba1fb59587e42
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Differential Revision: D46084590 fbshipit-source-id: 1b8a94d7eb886337c8764751e262df7016cbf7dc
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Differential Revision: D46084590 fbshipit-source-id: 190a7bc19a205837da91f268b078a34f145c3273
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Differential Revision: D46084590 fbshipit-source-id: ef02e43cff10311b29bff3d351839ac9fde13ddf
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Differential Revision: D46084590 fbshipit-source-id: 3c36b52615176b8941f24682e993f63c564942d9
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Differential Revision: D46084590 fbshipit-source-id: 59f99f5c2bc5c5424205bd668a6c7777ecf53f7b
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Differential Revision: D46084590 fbshipit-source-id: 6225ae90362ca478b6c3120febb1dcf93291a3d6
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Differential Revision: D46084590 fbshipit-source-id: 0d2e55c7150cf678a5feb1569beb4ab448565916
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Differential Revision: D46084590 fbshipit-source-id: 6dd9d2bc6cf86832b301a855337a28fed76d748a
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Differential Revision: D46084590 fbshipit-source-id: 92a7b39c9e0ae6fe9dced906376e294efc6a6bf7
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Reviewed By: jianyuh Differential Revision: D46084590 fbshipit-source-id: 04d4487c277e1d164e669af9bb71b4f6d19c1460
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Reviewed By: jianyuh Differential Revision: D46084590 fbshipit-source-id: 00a912f39e0b9f92c8c22b3a4b5ca26e0981d858
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Reviewed By: jianyuh Differential Revision: D46084590 fbshipit-source-id: eeb58b27c77ebcacba6ad27465c67feb673b4bbe
This pull request was exported from Phabricator. Differential Revision: D46084590 |
Summary: Pull Request resolved: pytorch#1897 This diff introduces `batch_index_select_dim0` using the `SplitTBE` implementation (it shares the same code generator as TBE). The new operator is designed to address limitations of `group_index_select_dim0`. Both operators are designed to operate multiple inputs. However, `batch_index_select_dim0` requires all inputs to be contiguous in memory, while `batch_index_select_dim0` can operate on inputs with a discrete memory layout. Implementation-wise, they are different. We plan to merge their backends in the future. Since `batch_index_select_dim0` is backed by TBE, it inherits TBE limitations including: - The column sizes must be a multiple of 4 and not exceed 1024. Moreover, the underlying buffer of the inputs tensor must be 16-byte aligned. This is because the TBE kernel uses a vector load/store which requires the buffer to be 16-byte aligned. The kernel will raise an error if this assumption is violated. - Due to the 16-byte aligned enforcement, during the backward pass, if the output gradient is not 16-byte aligned, the operator will copy the output gradient into a new 16-byte aligned buffer. This can be expensive if the output gradient size is large. Usage: ``` # This target might change in the future torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops") ... output = torch.ops.fbgemm.batch_index_select_dim0( inputs, # Tensor - 1D tensor (concatenated flatten inputs) indices, # Tensor - 1D tensor (concatenated indices) input_num_indices, # List[int] input_rows, # List[int] input_columns, # List[int] ) ``` Reviewed By: jianyuh Differential Revision: D46084590 fbshipit-source-id: 160ea0810abf3be3fc5f087f8a7a56437e481874
This pull request was exported from Phabricator. Differential Revision: D46084590 |
This pull request has been merged in 410d264. |
Summary:
Usage:
Differential Revision: D46084590