Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Enable collection with gpu engine #17550

Merged
merged 4 commits into from
Jul 25, 2024

Conversation

wence-
Copy link
Collaborator

@wence- wence- commented Jul 10, 2024

Introduce option to collect a query using the cudf_polars gpu engine. By default this falls back transparently to the default cpu engine if the query cannot be executed. This can be controlled by setting POLARS_GPU_DISABLE_FALLBACK (mostly useful for testing/debugging).

The import of cudf_polars is currently quite expensive (will improve in the coming months), so since we lazy-load the gpu engine, the first query executed on gpu will pay this (one-time) cost.

@wence-
Copy link
Collaborator Author

wence- commented Jul 10, 2024

For discussion if this is the interface we want (cc @r-brink, @ritchie46).

lazy-loading the gpu engine makes the UX of importing polars still fast if the relevant packages are available, but slows down the first query anyone runs.

I'm not sure there's a good solution here: eager-loading is fiddly because then it's easy to get into circular import issues between cudf_polars and polars itself (since cudf_polars depends on polars).

@wence- wence- changed the title feat(python): enable collection with gpu engine feat(python): Enable collection with gpu engine Jul 10, 2024
@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars and removed title needs formatting labels Jul 10, 2024
@wence- wence- force-pushed the wence/fea/polars-gpu-collect branch from d5eb4fd to ae70be9 Compare July 10, 2024 11:56
py-polars/polars/lazyframe/frame.py Outdated Show resolved Hide resolved
py-polars/polars/lazyframe/frame.py Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jul 10, 2024

Codecov Report

Attention: Patch coverage is 91.17647% with 3 lines in your changes missing coverage. Please review.

Project coverage is 80.53%. Comparing base (26a16e3) to head (06bb0bd).
Report is 16 commits behind head on main.

Files Patch % Lines
py-polars/polars/lazyframe/frame.py 83.33% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #17550      +/-   ##
==========================================
+ Coverage   80.49%   80.53%   +0.03%     
==========================================
  Files        1503     1505       +2     
  Lines      197054   197187     +133     
  Branches     2805     2810       +5     
==========================================
+ Hits       158615   158795     +180     
+ Misses      37918    37873      -45     
+ Partials      521      519       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jfpuget
Copy link

jfpuget commented Jul 18, 2024

This looks great.

What about using it on multi GPU workstation or multi GPU cloud images? Can we set a specific GPU when running on a mutli gpu workstation? Pytorch has "cuda" device, but also "cuda:n" where n is the GPU index. The latter is very useful when one wants to run different pipelines on each available GPU.

@ritchie46
Copy link
Member

Yes, this seems right to me @wence- . 👍

I hope we will be able to do mixed subplans soon. If we can we must sell that in the docstrings as well.

@wence-
Copy link
Collaborator Author

wence- commented Jul 19, 2024

I think we want to allow (as @jfpuget suggests) configuration of which device the query will execute on. Let's paint this shed:

Some suggestions:

We could pick something like the pytorch approach, and rather than a boolean flag gpu=True/False could introduce an engine=... parameter. That way, we avoid some of the complexity around streaming/background mode not being compatible (for now) with the gpu engine.

Downside of this approach is that it is more (slightly) complex, and if we using a string for the engine parameter we can't provide a fully-typed signature. We could get around that by accepting strings, or small classes.

For example (sketch):

@dataclass
class GpuEngine(EngineConfig):
   device: int
   # example other options, advanced users may wish to control
   # the memory allocator used (rather than using our default)
   ... other options here

@dataclass
class StreamingEngine(EngineConfig):
    ... streaming options

...

EngineOptions = Literal["default", "cpu", "streaming", "gpu"] | EngineConfig

def collect(..., engine: EngineOptions = "default")

Or, we can stick with the current configuration through environment variables/pl.Config approach so we would have:

export POLARS_GPU_DEVICE=1

python run.py # runs query on device 1
with pl.Config(gpu_device=1):
    q.collect(gpu=True) # runs on device 1.

Advantages: fits with the rest of the polars configuration setup, and keeps the default UX simple

Disadvantages: passing "complicated" additional configuration options is more challenging, since we can only shove (effectively) strings in as values.

@jfpuget
Copy link

jfpuget commented Jul 22, 2024

Pytorch uses either string or a device object. The later enables type checking. Seems similar to what you propose.

@ritchie46
Copy link
Member

I think it is best to have this local (e.g. with arguments). That way you can collect multiple plans on different devices if you'd like.

Let's go for the engine options API.

@jfpuget
Copy link

jfpuget commented Jul 22, 2024

To be more complete, pytorch and others like vllm also allow environment variables.

With the environment variable CUDA_VISIBLE_DEVICES (if I rememeber the name correctly) you can select which GPU a process sees, befor eimporting pytorch, e.g.

os.environ["CUDA_VISIBLE_DEVICES"] = "4,5,6,7"
import torch

Then in the code you can create a device for one of the four gpu that are accessible, e.g

tensor(data, devicce='cuda:2')

or

device = torch.device("cuda:2")
x = tensor(data, device=device)

@jfpuget
Copy link

jfpuget commented Jul 22, 2024

TBH I don't know if the CUDA_VISIBLE_DEVICES variable is read when torch is imported or when a device is created.

@wence-
Copy link
Collaborator Author

wence- commented Jul 22, 2024

CUDA_VISIBLE_DEVICES affects which physical devices are accessible to the cuda driver and runtime. The mapping is initialised during the initialisation of the cuda driver. So in response to @jfpuget:

TBH I don't know if the CUDA_VISIBLE_DEVICES variable is read when torch is imported or when a device is created.

It depends on when torch initialises the cuda runtime/driver stack.

However, in any case, in a single process once the driver is initialised, any subsequent modification of CUDA_VISIBLE_DEVICES does nothing.

Moreover, the mapping is managed by the driver, so we don't need to do anything special here to support it: the ordinal device ID that one passes maps directly on to the ID in CUDA_VISIBLE_DEVICES order.

@jfpuget
Copy link

jfpuget commented Jul 22, 2024

It depends on when torch initialises the cuda runtime/driver stack.

Exactly, and I don't know when this happens. It is why I set the environment variable before importing torch in my code.

@wence- wence- force-pushed the wence/fea/polars-gpu-collect branch 2 times, most recently from 90a5de7 to 7088992 Compare July 22, 2024 17:14
@wence- wence- marked this pull request as ready for review July 22, 2024 17:16
@wence-
Copy link
Collaborator Author

wence- commented Jul 22, 2024

This is getting close I think, implemented the engine=SomeObject approach.

@bdice, some suggestions on the pyproject deps side of thing would be good.

We are planning that:

pip install polars[gpu]

Gets the CUDA-12 enabled package. We will provide fallback errors on our (cudf-polars) side at import if the host CTK is too old (and they should therefore install cudf-polars-cu11 instead).

There is not (yet) a package for cudf polars on pypi (that is coming as part of the rapids 24.08 release).

@wence- wence- force-pushed the wence/fea/polars-gpu-collect branch 2 times, most recently from df7e427 to 67065b8 Compare July 24, 2024 13:16
Introduce option to collect a query using the cudf_polars gpu engine.
By default this falls back transparently to the default cpu engine if
the query cannot be executed. Configuration of specifics of the GPU
engine can be controlled by passing a GPUEngine object to the new
`engine` argument to `collect`.

The import of cudf_polars is currently quite expensive (will improve
in the coming months), so since we lazy-load the gpu engine, the first
query executed on gpu will pay this (one-time) cost.
@wence- wence- force-pushed the wence/fea/polars-gpu-collect branch from 67065b8 to 98cc120 Compare July 24, 2024 13:18
Copy link

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this up!

Looking forward to collect on GPU.

Some packaging thoughts.

py-polars/pyproject.toml Show resolved Hide resolved
py-polars/pyproject.toml Show resolved Hide resolved
py-polars/polars/lazyframe/frame.py Show resolved Hide resolved
py-polars/pyproject.toml Show resolved Hide resolved
@wence- wence- force-pushed the wence/fea/polars-gpu-collect branch from 52da91f to 7fafb7c Compare July 25, 2024 10:49
@wence- wence- force-pushed the wence/fea/polars-gpu-collect branch from 7fafb7c to 06bb0bd Compare July 25, 2024 11:01
@ritchie46
Copy link
Member

ritchie46 commented Jul 25, 2024

Alright. Thanks everyone! Going in. :shipit:

@ritchie46 ritchie46 merged commit 4739460 into pola-rs:main Jul 25, 2024
14 checks passed
@wence- wence- deleted the wence/fea/polars-gpu-collect branch July 26, 2024 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants