Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SwiGLU optimized fw/bw #490

Merged
merged 36 commits into from
Nov 10, 2022
Merged

SwiGLU optimized fw/bw #490

merged 36 commits into from
Nov 10, 2022

Conversation

danthe3rd
Copy link
Contributor

@danthe3rd danthe3rd commented Oct 24, 2022

Stack from ghstack (oldest at bottom):

NOTE
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

USAGE

import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3)

PERFORMANCE (A100 only)

FW

[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).

BW

[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).

[ghstack-poisoned]
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 24, 2022
@danthe3rd danthe3rd mentioned this pull request Oct 24, 2022
danthe3rd pushed a commit that referenced this pull request Oct 24, 2022
ghstack-source-id: eb1801be830e5b7af5f4913eaf3a0e76c1465a69
Pull Request resolved: #490
danthe3rd pushed a commit that referenced this pull request Oct 24, 2022
ghstack-source-id: b30671b2fc7903973bf6e5ab83542532b91d3d74
Pull Request resolved: #490
danthe3rd pushed a commit that referenced this pull request Oct 25, 2022
ghstack-source-id: c7a4eda1bead77d8a8ba18deaf4067cf23402205
Pull Request resolved: #490
danthe3rd pushed a commit that referenced this pull request Oct 25, 2022
ghstack-source-id: 055d35dff615ebcc5c8380d07a0b580a67260c52
Pull Request resolved: #490
danthe3rd pushed a commit that referenced this pull request Oct 25, 2022
ghstack-source-id: a87a46d345dcb98dc0c53c56575fcda38cd5bccd
Pull Request resolved: #490
danthe3rd pushed a commit that referenced this pull request Oct 25, 2022
ghstack-source-id: e8f89ae5e89d7fc7907a6cd32d2fd85e04b08eda
Pull Request resolved: #490
danthe3rd pushed a commit that referenced this pull request Oct 25, 2022
ghstack-source-id: b864e2340fdea7f0c6819f9349dbb3b41766c9f1
Pull Request resolved: #490
danthe3rd pushed a commit that referenced this pull request Oct 25, 2022
ghstack-source-id: 1ff447ae98cc07c4e3de9653884175cc4c59b5ec
Pull Request resolved: #490

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3
    op=xops.SwiGLUFusedOp)
```


[ghstack-poisoned]
danthe3rd pushed a commit that referenced this pull request Oct 25, 2022
ghstack-source-id: 7b874c69561bf1756e95ccfad9407e4ea9d18e85
Pull Request resolved: #490

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3
    op=xops.SwiGLUFusedOp)
```


[ghstack-poisoned]
danthe3rd pushed a commit that referenced this pull request Oct 25, 2022
ghstack-source-id: abc12d1ec3cabbec2ebd5c2fffb72167609f3d85
Pull Request resolved: #490

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3
    op=xops.SwiGLUFusedOp)
```


[ghstack-poisoned]
danthe3rd pushed a commit that referenced this pull request Oct 25, 2022
ghstack-source-id: 520ade162a45516f01a84e551958e4c54a0fe4e3
Pull Request resolved: #490
@danthe3rd danthe3rd mentioned this pull request Oct 26, 2022
danthe3rd added 2 commits October 26, 2022 14:54

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3
    op=xops.SwiGLUFusedOp)
```


[ghstack-poisoned]

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3
    op=xops.SwiGLUPackedFusedOp)
```


[ghstack-poisoned]

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3
    op=xops.SwiGLUPackedFusedOp)
```


[ghstack-poisoned]
danthe3rd added 2 commits October 28, 2022 13:45

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3
    op=xops.SwiGLUPackedFusedOp)
```


[ghstack-poisoned]
**NOTE**
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3
    op=xops.SwiGLUPackedFusedOp)
```

**PERFORMANCE (A100 only)**

*FW*
```
[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).
```

*BW*
```
[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).
```

[ghstack-poisoned]
danthe3rd added 4 commits October 28, 2022 14:43
**NOTE**
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3
    op=xops.SwiGLUPackedFusedOp)
```

**PERFORMANCE (A100 only)**

*FW*
```
[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).
```

*BW*
```
[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).
```

[ghstack-poisoned]
**NOTE**
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3
    op=xops.SwiGLUPackedFusedOp)
```

**PERFORMANCE (A100 only)**

*FW*
```
[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).
```

*BW*
```
[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).
```

[ghstack-poisoned]
**NOTE**
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3
    op=xops.SwiGLUPackedFusedOp)
```

**PERFORMANCE (A100 only)**

*FW*
```
[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).
```

*BW*
```
[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).
```

[ghstack-poisoned]
**NOTE**
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3
    op=xops.SwiGLUPackedFusedOp)
```

**PERFORMANCE (A100 only)**

*FW*
```
[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).
```

*BW*
```
[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).
```

[ghstack-poisoned]
@codecov-commenter
Copy link

codecov-commenter commented Oct 28, 2022

Codecov Report

Base: 90.60% // Head: 88.38% // Decreases project coverage by -2.22% ⚠️

Coverage data is based on head (3490242) compared to base (d027db4).
Patch coverage: 41.29% of modified lines in pull request are covered.

Additional details and impacted files
@@                   Coverage Diff                    @@
##           gh/danthe3rd/52/base     #490      +/-   ##
========================================================
- Coverage                 90.60%   88.38%   -2.23%     
========================================================
  Files                        79       80       +1     
  Lines                      4652     4785     +133     
========================================================
+ Hits                       4215     4229      +14     
- Misses                      437      556     +119     
Flag Coverage Δ
Python 88.38% <41.29%> (-2.23%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
xformers/info.py 0.00% <0.00%> (ø)
xformers/ops/swiglu.py 35.93% <39.00%> (-54.54%) ⬇️
xformers/ops/common.py 62.50% <62.50%> (ø)
xformers/ops/__init__.py 81.25% <100.00%> (ø)
xformers/ops/memory_efficient_attention.py 85.79% <100.00%> (+0.58%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

danthe3rd added 2 commits October 31, 2022 09:43
**NOTE**
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3
    op=xops.SwiGLUPackedFusedOp)
```

**PERFORMANCE (A100 only)**

*FW*
```
[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).
```

*BW*
```
[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).
```

[ghstack-poisoned]
**NOTE**
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3)
```

**PERFORMANCE (A100 only)**

*FW*
```
[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).
```

*BW*
```
[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).
```

[ghstack-poisoned]
**NOTE**
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3)
```

**PERFORMANCE (A100 only)**

*FW*
```
[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).
```

*BW*
```
[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).
```

[ghstack-poisoned]
**NOTE**
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3)
```

**PERFORMANCE (A100 only)**

*FW*
```
[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).
```

*BW*
```
[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).
```

[ghstack-poisoned]
@danthe3rd danthe3rd mentioned this pull request Nov 4, 2022
danthe3rd added 2 commits November 4, 2022 10:04
**NOTE**
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3)
```

**PERFORMANCE (A100 only)**

*FW*
```
[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).
```

*BW*
```
[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).
```

[ghstack-poisoned]
**NOTE**
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3)
```

**PERFORMANCE (A100 only)**

*FW*
```
[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).
```

*BW*
```
[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).
```

[ghstack-poisoned]
**NOTE**
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3)
```

**PERFORMANCE (A100 only)**

*FW*
```
[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).
```

*BW*
```
[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).
```

[ghstack-poisoned]
Copy link
Contributor

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

.circleci/config.yml Show resolved Hide resolved
setup.py Show resolved Hide resolved
setup.py Show resolved Hide resolved
tests/test_swiglu.py Show resolved Hide resolved
xformers/components/swiglu/swiglu_packedw.cpp Show resolved Hide resolved
xformers/ops/swiglu.py Show resolved Hide resolved
xformers/ops/swiglu.py Show resolved Hide resolved
xformers/info.py Show resolved Hide resolved
**NOTE**
We can improve a bit more once this is fixed - NVIDIA/cutlass#674

**USAGE**

```python
import xformers.ops as xops

# NOTE: Important to use `unbind` from xformers for the bw pass!
w1, w2 = xops.unbind(
    w1w2.view([2, w1w2.shape[0] // 2, w1w2.shape[1]]),
    dim=0,
)
b1, b2 = xops.unbind(b1b2.view([2, b1b2.shape[0] // 2]), dim=0)
y = xops.functional_swiglu(x,
    w1, b1, w2, b2, w3, b3)
```

**PERFORMANCE (A100 only)**

*FW*
```
[-------------------------------------------------------- swiglu_fw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               1377.7               |  1581.4  |         1339.1
      f16.ac B=9456, I=1536, H=4096  |               1449.3               |  1735.3  |         1462.9
      f16    B=4440, I=1536, H=4096  |                600.4               |   735.6  |          593.9
      f16.ac B=4440, I=1536, H=4096  |                709.0               |   843.7  |          717.6
      f16    B=4728, I=1536, H=4096  |                638.9               |   776.2  |          635.3
      f16.ac B=4728, I=1536, H=4096  |                748.9               |   892.2  |          756.7
      f16    B=4728, I=1536, H=1024  |                162.3               |   201.5  |          163.1
      f16.ac B=4728, I=1536, H=1024  |                235.2               |   277.4  |          245.5

Times are in microseconds (us).
```

*BW*
```
[-------------------------------------------------------- swiglu_bw ---------------------------------------------------------]
                                     |  SwiGLUPackedFusedOp[fused.p.cpp]  |  eager   |  SwiGLUFusedOp[fused]
1 threads: -------------------------------------------------------------------------------------------------------------------
      f16    B=9456, I=1536, H=4096  |               2333.1               |  2696.7  |         2336.1
      f16.ac B=9456, I=1536, H=4096  |               2620.8               |  2990.9  |         2840.0
      f16    B=4440, I=1536, H=4096  |               1243.2               |  1413.8  |         1240.3
      f16.ac B=4440, I=1536, H=4096  |               1448.6               |  1629.0  |         1637.3
      f16    B=4728, I=1536, H=4096  |               1298.4               |  1481.5  |         1301.1
      f16.ac B=4728, I=1536, H=4096  |               1511.8               |  1705.3  |         1705.4
      f16    B=4728, I=1536, H=1024  |                463.3               |   493.9  |          463.0
      f16.ac B=4728, I=1536, H=1024  |                582.4               |   614.9  |          672.7

Times are in microseconds (us).
```

[ghstack-poisoned]
@danthe3rd danthe3rd merged commit a90fe49 into gh/danthe3rd/52/base Nov 10, 2022
danthe3rd pushed a commit that referenced this pull request Nov 10, 2022
ghstack-source-id: 7998ff3210011362be7c379666655e9bc5078dde
Pull Request resolved: #490
@danthe3rd danthe3rd deleted the gh/danthe3rd/52/head branch November 10, 2022 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants