Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2D downsampling of uint8 data inefficient #737

Closed
nkemnitz opened this issue Jul 3, 2024 · 5 comments · Fixed by #894
Closed

2D downsampling of uint8 data inefficient #737

nkemnitz opened this issue Jul 3, 2024 · 5 comments · Fixed by #894
Labels
enhancement New feature or request

Comments

@nkemnitz
Copy link
Collaborator

nkemnitz commented Jul 3, 2024

All our tensors are passed as NCXYZ to torch and converted to float32. That's not just a copy, but also 4x more memory.

  • pytorch interpolate supports 'bilinear' downsampling for uint8 data, but requires NCXY tensor input
  • tinybrain has fast AvgPool for (2,2), (2,2,1), (2,2,1,1), as well as for (2,2,2), (2,2,2,1) ndarrays

Another thing to consider is that CloudVolume data already is in Fortran order, which tinybrain expects

data = np.asfortranarray(np.random.randint(0,255, size=(1,1,4096,4096,1), dtype=np.uint8))
# Torch CPU, uint8->float32->uint8
%timeit torch.nn.functional.interpolate(torch.from_numpy(data).float(), scale_factor=[0.5,0.5,1.0], mode='trilinear').byte()
84 ms ± 774 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Torch GPU, uint8->float32->uint8
%timeit torch.nn.functional.interpolate(torch.from_numpy(data).cuda().float(), scale_factor=[0.5,0.5,1.0], mode='trilinear').byte().cpu()
6.34 ms ± 32.5 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Torch CPU, uint8
%timeit torch.nn.functional.interpolate(torch.from_numpy(data).squeeze(-1), scale_factor=[0.5,0.5], mode='bilinear')
162 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Torch CUDA, uint8
%timeit torch.nn.functional.interpolate(torch.from_numpy(data).cuda().squeeze(-1), scale_factor=[0.5,0.5], mode='bilinear').cpu()
RuntimeError: "upsample_bilinear2d_out_frame" not implemented for 'Byte'
# Tinybrain, uint8->float32->uint8
%timeit tinybrain.downsample_with_averaging(data.astype(np.float32).squeeze((0,1)), factor=[2,2])[0].astype(np.uint8)
32.1 ms ± 254 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Tinybrain, uint8
%timeit tinybrain.downsample_with_averaging(data.squeeze((0,1)), factor=[2,2])[0]
1.45 ms ± 12.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
@nkemnitz nkemnitz added the enhancement New feature or request label Jul 3, 2024
@supersergiy
Copy link
Member

It's unexpected for me that performance matters here. I thought interpolation performance would be mostly bound by bandwidth

@nkemnitz
Copy link
Collaborator Author

nkemnitz commented Jul 3, 2024

Just checked - downloading a 4k x 4k uint8 JPG patch is 100-150 ms. Similar to current downsampling behavior

@supersergiy
Copy link
Member

Wow, that's a crazy fast download! But also, doesn't that mean that there's basically no inefficiency if we use pipelining? At the same time, it maybe doesn't matter and we should just put tinybrain in instead of default torch behavior. It's not a hard fix.

@supersergiy
Copy link
Member

@nkemnitz we already use tinybrain for segmentation as of a while ago. Should this be closed?

if mode == "segmentation" and (
scale_factor_tuple is not None
and (
tuple(scale_factor_tuple)
in (
[(0.5 ** i, 0.5 ** i) for i in range(1, 5)] # 2D factors of 2
+ [(0.5 ** i, 0.5 ** i, 1) for i in range(1, 5)]
+ [(0.5 ** i, 0.5 ** i, 0.5 ** i) for i in range(1, 5)] # #D factors of 2
)
)
and data.shape[0] == 1
): # use tinybrain
result_raw = _interpolate_segmentation_with_tinybrain(
data=data, scale_factor_tuple=scale_factor_tuple
)

@nkemnitz
Copy link
Collaborator Author

Still relevant for average_downsampling of uint8 images. Especially the 4x memory savings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants