Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batch_index_select_dim0 (w/ TBE backend) #1897

Closed
wants to merge 1 commit into from

Conversation

sryap
Copy link
Contributor

@sryap sryap commented Jul 27, 2023

Summary:
Usage:

# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )

Differential Revision: D46084590

@netlify
Copy link

netlify bot commented Jul 27, 2023

Deploy Preview for pytorch-fbgemm-docs canceled.

Name Link
🔨 Latest commit e5ee9d2
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/64c9454e8a817f00075d41cd

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 27, 2023
Summary:
Pull Request resolved: pytorch#1897

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: a5ec8c2d45ae39d5eb79b61a8263e112276de50f
@sryap sryap force-pushed the export-D46084590 branch from faeddd5 to 43819cc Compare July 27, 2023 23:40
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

@sryap sryap force-pushed the export-D46084590 branch from 43819cc to c8d9b41 Compare July 28, 2023 00:22
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 28, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 96fb152d0270e2d09127fbaab349b5ac02068bcb
@sryap sryap force-pushed the export-D46084590 branch from c8d9b41 to 45f055b Compare July 28, 2023 00:28
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 28, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 69df7e36784e77bc4d06cec2e9aba1fb59587e42
@sryap sryap force-pushed the export-D46084590 branch from 45f055b to 77f1ffc Compare July 28, 2023 01:00
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 28, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 1b8a94d7eb886337c8764751e262df7016cbf7dc
@sryap sryap force-pushed the export-D46084590 branch from 77f1ffc to 9e2b759 Compare July 28, 2023 01:04
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 28, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 190a7bc19a205837da91f268b078a34f145c3273
@sryap sryap force-pushed the export-D46084590 branch from 9e2b759 to 4e51320 Compare July 28, 2023 01:11
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 28, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: ef02e43cff10311b29bff3d351839ac9fde13ddf
@sryap sryap force-pushed the export-D46084590 branch from 4e51320 to 8967bf7 Compare July 28, 2023 01:18
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 28, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 3c36b52615176b8941f24682e993f63c564942d9
@sryap sryap force-pushed the export-D46084590 branch from 8967bf7 to 370efc0 Compare July 28, 2023 01:34
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 28, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 59f99f5c2bc5c5424205bd668a6c7777ecf53f7b
@sryap sryap force-pushed the export-D46084590 branch from 370efc0 to 1140128 Compare July 28, 2023 07:27
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 28, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 6225ae90362ca478b6c3120febb1dcf93291a3d6
@sryap sryap force-pushed the export-D46084590 branch from 1140128 to 9ee003b Compare July 28, 2023 23:13
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 28, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 0d2e55c7150cf678a5feb1569beb4ab448565916
@sryap sryap force-pushed the export-D46084590 branch from 9ee003b to 886b0b0 Compare July 28, 2023 23:28
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 28, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 6dd9d2bc6cf86832b301a855337a28fed76d748a
@sryap sryap force-pushed the export-D46084590 branch from 886b0b0 to d8bbfa6 Compare July 28, 2023 23:36
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Jul 28, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Differential Revision: D46084590

fbshipit-source-id: 92a7b39c9e0ae6fe9dced906376e294efc6a6bf7
@sryap sryap force-pushed the export-D46084590 branch from d8bbfa6 to f7df255 Compare July 28, 2023 23:44
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Aug 1, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Reviewed By: jianyuh

Differential Revision: D46084590

fbshipit-source-id: 04d4487c277e1d164e669af9bb71b4f6d19c1460
@sryap sryap force-pushed the export-D46084590 branch from f7df255 to acb9aff Compare August 1, 2023 01:13
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Aug 1, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Reviewed By: jianyuh

Differential Revision: D46084590

fbshipit-source-id: 00a912f39e0b9f92c8c22b3a4b5ca26e0981d858
@sryap sryap force-pushed the export-D46084590 branch from acb9aff to 49239ea Compare August 1, 2023 01:22
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

sryap added a commit to sryap/FBGEMM that referenced this pull request Aug 1, 2023
Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Reviewed By: jianyuh

Differential Revision: D46084590

fbshipit-source-id: eeb58b27c77ebcacba6ad27465c67feb673b4bbe
@sryap sryap force-pushed the export-D46084590 branch from 49239ea to 47949dc Compare August 1, 2023 17:42
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

Summary:
Pull Request resolved: pytorch#1897

This diff introduces `batch_index_select_dim0` using the `SplitTBE`
implementation (it shares the same code generator as TBE).  The new
operator is designed to address limitations of
`group_index_select_dim0`.  Both operators are designed to operate
multiple inputs.  However, `batch_index_select_dim0` requires all
inputs to be contiguous in memory, while `batch_index_select_dim0` can
operate on inputs with a discrete memory layout.  Implementation-wise,
they are different.  We plan to merge their backends in the future.

Since `batch_index_select_dim0` is backed by TBE, it inherits TBE
limitations including:
- The column sizes must be a multiple of 4 and not exceed 1024.
  Moreover, the underlying buffer of the inputs tensor must be 16-byte
  aligned.  This is because the TBE kernel uses a vector load/store
  which requires the buffer to be 16-byte aligned.  The kernel will
  raise an error if this assumption is violated.
- Due to the 16-byte aligned enforcement, during the backward pass, if
  the output gradient is not 16-byte aligned, the operator will copy
  the output gradient into a new 16-byte aligned buffer.  This can be
  expensive if the output gradient size is large.

Usage:

```
# This target might change in the future
torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu/codegen:index_select_ops")

...

output = torch.ops.fbgemm.batch_index_select_dim0(
            inputs, # Tensor - 1D tensor (concatenated flatten inputs)
            indices, # Tensor - 1D tensor (concatenated indices)
            input_num_indices, # List[int]
            input_rows, # List[int]
            input_columns, # List[int]
         )
```

Reviewed By: jianyuh

Differential Revision: D46084590

fbshipit-source-id: 160ea0810abf3be3fc5f087f8a7a56437e481874
@sryap sryap force-pushed the export-D46084590 branch from 47949dc to e5ee9d2 Compare August 1, 2023 17:47
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D46084590

@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 410d264.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants