-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuDF-style operations & NVTX annotations for local CuPy benchmark #548
cuDF-style operations & NVTX annotations for local CuPy benchmark #548
Conversation
I'm getting several compute failures from the workers when running
From a cursory glance, it looks like something is happening in |
@charlesbluca dask/dask#7364 should fix the issue above. |
Thanks @pentschev! |
rerun tests |
It looks like there is a style issue. Charles, could you please run |
Done! Thanks for the catch @jakirkham 🙂 |
@pentschev do you have any more thoughts here? 🙂 |
|
||
func_args = (x, idx) | ||
|
||
func = lambda x, idx: x[idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be using .copy()
here like we do with the slicing operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I remember when I wrote CuPy benchmarks, I had to add .copy
to ensure it actually slice the array instead of returning a view. I'm not totally sure whether there's a case where Dask would return a view only, do you know @jakirkham ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That depends on what computational backend Dask is using. If we are using the threaded scheduler, it is probably a view. If we are using the Distributed Scheduler, it may be a view if the data was already on that worker. Otherwise it wouldn't be
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the safest here is to actually profile both cases. Ideally what we would see if both cases aren't returning a view is:
- With
.copy()
: some kernels with some copy (or copies) at the end; - Without
.copy()
: same kernels as above but not copies at the end.
It looks like the compute failures are still happening even with @pentschev's fix - I'll try to dig more into this. |
Where are you seeing failures? Is this locally? AFAICT CI passed (though haven't dug into the logs) |
Do you have more details? This is the simplified version of your code I used as sample to test for that Dask PR: import numpy as np, cupy as cp, dask.array as da
rs = da.random.RandomState(RandomState=cp.random.RandomState)
x = rs.normal(10, 1, (1000,))
idx = rs.randint(0, len(x), (1000,))
print(x[idx].compute())
print(type(x[idx].compute())) |
This is local, using the
It seems like this is an issue with |
Could you please reduced this to an MRE and file on the Distributed repo? |
Sure! Would this be better suited for Distributed or Dask? It seems like this is a problem regardless of if a cluster is in use: import cupy
import dask.array as da
rs = da.random.RandomState(RandomState=cupy.random.RandomState)
x = rs.normal(10, 1, (1000,))
idx = rs.randint(0, len(x), (1000,))
x[idx].persist() Outputs:
|
Can you please double-check you indeed have dask/dask#7364 in your install? The example you posted works for me after that PR. |
Just checked and I do have those commits in my install. I also see now that your local tests are failing too, so this is probably something with my local env. What version of CuPy are you using? I realized I had been using a version with cupy#4322 but I'm still getting the same errors with 8.5.0. |
Would make sure that any existing |
Made sure dask was entirely uninstalled + removed from the env before doing dev install, still getting the same error - I think this might be an issue with my version/installation of CuPy. |
You also need NumPy>=1.20, can you confirm you have that too? |
Yup, that was it - thanks for the help! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, and since the issues @charlesbluca was having were resolved, I'm gonna go ahead and merge this. Thanks for working on this!
rng = start_range(message="make array(s)", color="green") | ||
x = rs.random((args.size, args.size), chunks=args.chunk_size).persist() | ||
await wait(x) | ||
end_range(rng) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given it's only two lines, I'm not sure it's worth the trouble, but feels like the start_range
/end_range
would be a perfect candidate for a decorate function. Just saying for future consideration, no action required. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or using contextmanager
?
cc @shwina (in case you have thoughts on how to do this 🙂)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was considering the same - I'll look into that if we end up extending the benchmark again 😁
@gpucibot merge |
Thanks for the PR Charles and Peter for the review! 😄 |
Thanks for the reviews and environment help! |
Just realized we missed the column masking case, submitted PR ( #553 ) to include that |
This PR adds the following to the local CuPy benchmark:
args.rmm_pool_size
to the worker memory pool setup so user-defined pool sizes workSome questions:
cc @shwina