dev(hansbug): add stream support for paralleling the calculations in tree #10

HansBug · 2022-08-12T15:14:37Z

Here is an example:

import time

import numpy as np
import torch

import treetensor.torch as ttorch

N, M, T = 200, 2, 50
S1, S2, S3 = 512, 1024, 2048


def test_min():
    a = ttorch.randn({f'a{i}': (S1, S2) for i in range(N // M)}, device='cuda')
    b = ttorch.randn({f'a{i}': (S2, S3) for i in range(N // M)}, device='cuda')

    result = []
    for i in range(T):
        _start_time = time.time()

        _ = ttorch.matmul(a, b)
        torch.cuda.synchronize()

        _end_time = time.time()
        result.append(_end_time - _start_time)

    print('time cost: mean({}) std({})'.format(np.mean(result), np.std(result)))


def test_native():
    a = {f'a{i}': torch.randn(S1, S2, device='cuda') for i in range(N)}
    b = {f'a{i}': torch.randn(S2, S3, device='cuda') for i in range(N)}

    result = []
    for i in range(T):
        _start_time = time.time()

        for key in a.keys():
            _ = torch.matmul(a[key], b[key])
        torch.cuda.synchronize()

        _end_time = time.time()
        result.append(_end_time - _start_time)

    print('time cost: mean({}) std({})'.format(np.mean(result), np.std(result)))


def test_linear():
    a = ttorch.randn({f'a{i}': (S1, S2) for i in range(N)}, device='cuda')
    b = ttorch.randn({f'a{i}': (S2, S3) for i in range(N)}, device='cuda')

    result = []
    for i in range(T):
        _start_time = time.time()

        _ = ttorch.matmul(a, b)
        torch.cuda.synchronize()

        _end_time = time.time()
        result.append(_end_time - _start_time)

    print('time cost: mean({}) std({})'.format(np.mean(result), np.std(result)))


def test_stream():
    a = ttorch.randn({f'a{i}': (S1, S2) for i in range(N)}, device='cuda')
    b = ttorch.randn({f'a{i}': (S2, S3) for i in range(N)}, device='cuda')

    ttorch.stream(M)
    result = []
    for i in range(T):
        _start_time = time.time()

        _ = ttorch.matmul(a, b)
        torch.cuda.synchronize()

        _end_time = time.time()
        result.append(_end_time - _start_time)

    print('time cost: mean({}) std({})'.format(np.mean(result), np.std(result)))


def warmup():
    # warm up
    a = torch.randn(1024, 1024).cuda()
    b = torch.randn(1024, 1024).cuda()
    for _ in range(20):
        c = torch.matmul(a, b)


if __name__ == '__main__':
    warmup()
    test_min()
    test_native()
    test_linear()
    test_stream()

不过讲真，这个stream实际效果挺脆弱的，非常看tensor尺寸，大了小了都不行，GPU性能不够也不行，一弄不好还容易负优化，总之挺难伺候的。这部分如果想实用化的话得再研究研究。

treetensor/torch/stream.py

codecov · 2022-08-14T03:30:41Z

Codecov Report

Merging #10 (6a3a7dd) into main (5d34647) will decrease coverage by 0.64%.
The diff coverage is 92.89%.

@@            Coverage Diff             @@
##             main      #10      +/-   ##
==========================================
- Coverage   97.84%   97.19%   -0.65%     
==========================================
  Files          32       33       +1     
  Lines        1576     1607      +31     
==========================================
+ Hits         1542     1562      +20     
- Misses         34       45      +11

Flag	Coverage Δ
unittests	`97.19% <92.89%> (-0.65%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
treetensor/torch/stream.py	`52.17% <52.17%> (ø)`
treetensor/torch/tensor.py	`99.36% <98.50%> (+<0.01%)`	⬆️
treetensor/torch/__init__.py	`97.50% <100.00%> (+0.13%)`	⬆️
treetensor/torch/funcs/comparison.py	`100.00% <100.00%> (ø)`
treetensor/torch/funcs/construct.py	`100.00% <100.00%> (ø)`
treetensor/torch/funcs/math.py	`100.00% <100.00%> (ø)`
treetensor/torch/funcs/matrix.py	`100.00% <100.00%> (ø)`
treetensor/torch/funcs/operation.py	`100.00% <100.00%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

HansBug added 3 commits August 2, 2022 15:01

dev(hansbug): use stream function to wrap native matmul function

43b1aaf

dev(hansbug): add stream.py

ee35408

dev(hansbug): use stream_call for most of the functions

1696c8b

HansBug added the enhancement New feature or request label Aug 12, 2022

HansBug requested a review from PaParaZz1 August 12, 2022 15:14

HansBug self-assigned this Aug 12, 2022

PaParaZz1 reviewed Aug 13, 2022

View reviewed changes

treetensor/torch/stream.py Outdated Show resolved Hide resolved

PaParaZz1 approved these changes Aug 13, 2022

View reviewed changes

HansBug added 2 commits August 14, 2022 10:52

Merge branch 'main' into dev/stream

8178b49

dev(hansbug): add support for CUDA stream test

6a3a7dd

HansBug merged commit 554b8ca into main Aug 14, 2022

HansBug deleted the dev/stream branch August 14, 2022 04:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dev(hansbug): add stream support for paralleling the calculations in tree #10

dev(hansbug): add stream support for paralleling the calculations in tree #10

HansBug commented Aug 12, 2022

codecov bot commented Aug 14, 2022 •

edited

Loading

dev(hansbug): add stream support for paralleling the calculations in tree #10

dev(hansbug): add stream support for paralleling the calculations in tree #10

Conversation

HansBug commented Aug 12, 2022

codecov bot commented Aug 14, 2022 • edited Loading

Codecov Report

codecov bot commented Aug 14, 2022 •

edited

Loading