Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Unify device configuration. #7308

Closed
trivialfis opened this issue Oct 11, 2021 · 6 comments
Closed

[RFC] Unify device configuration. #7308

trivialfis opened this issue Oct 11, 2021 · 6 comments

Comments

@trivialfis
Copy link
Member

trivialfis commented Oct 11, 2021

This is a continuation of #4600

Overview

Use global configuration

From my perspective, this method is cleaner and covers both DMatrix and Booster so it's listed first. An easier-to-implement solution is described in the next section.

Define a new device parameter for XGBoost as global configuration and remove existing parameters including gpu_hist, gpu_id, and predictor. For the native Python interface, it will look like this:

with xgboost.config_context(device="CUDA:0"):
    Xy = xgb.DMatrix(X, y)
    xgboost.train({"tree_method": "hist"}, Xy)
    xgboost.predict(Xy)

The above code snippet should run on the first CUDA device, using GPU implementation of hist tree method. Also, the prediction should run on the same device regardless of the location of input data. The scikit-learn interface will look like this:

clf = xgb.XGBClassifier(device="CUDA:0", tree_method="hist")

while the config context is created internally in each function of XGBClassifier. For R users, we also have the xgb.set_config function that changes global parameters.

JVM packages are lagging behind. But in theory, we can have something similar. For the Java binding, we can define functions that are similar to R or Python xgb.set_config to set the global parameter. For Scala binding, we have high-level estimators like XGBClassifier in Python so we can handle the configuration internally.

Last but not least, the C interface is the basis of all other interfaces, so its implementation should be trivial.

For handling existing code, my suggestion would be simply to throw an informative error. For example, if the user has specified gpu_hist, then we require device also to be set.

Alternative solution

This might be more practical in short term. The device parameter doesn't have to be a global parameter. Like the currently available gpu_id, which is a parameter for the booster object. Hence we can keep it that way and reuse the gpu_id parameter. This is still a breaking change due to other removed parameter but require lesser changes. For the native Python interface, it will look like this:

Xy = xgb.DMatrix(X, y)
booster = xgboost.train({"tree_method": "hist", "gpu_id": "CUDA:0"}, Xy)
# or
booster = xgboost.train({"tree_method": "hist", "gpu_id": "0"}, Xy)
# for compatibility reason.

booster.set_param({"gpu_id": "CPU"})
# Use CPU for prediction.
booster.predict(Xy)

Motivation

Device management has been a headache over the past. We have removed the n_gpus parameter in the 1.0 release, which helped clean up the code a little bit. But there are still many other issues in the current device management configuration. The most important one is, we need a single source of information about device ordinal. Currently, the device is chosen based on the following inputs:

  • gpu_id parameter: the supposed only authority that was rarely honored.
  • tree_method parameter. gpu_hist or not.
  • predictor parameter: gpu_predictor or not.
  • data: Whether data is already on the device (like cupy, cuDF).
  • At the first iteration, where XGBoost tries to avoid copying data into the device by using CPU prediction.
  • environment: Does the environment have GPU at all? This might happen after users load a pickled model on CPU only machine.
  • model: The user might continue training on an existing model, in which case we don't want to pull the data into GPU for initial prediction.
  • custom objective: The gradient returned is on CPU while XGBoost might be running on GPU.

As one might see, there are too many correlated factors influencing the decision of device ordinal, and sometimes they are conflicting with each other. For instance, Setting "gpu_hist" leads to gpu_id >= 0

"gpu_hist" -> gpu_id = 0

then if a user wants to run prediction on the CPU, the predictor might be set:

booster.set_param({"predictor": "cpu_predictor"})

Then what's the current gpu_id? I don't know. The problem is getting worse with inplace prediction and GPU data inputs. Also, with the OneAPI proposal from Intel, we have a growing number of configurations, and the existing approach simply cannot handle the complexity.

Implementation

Depending on which solution is chosen, global parameter or booster parameter, we might opt for a different implementation. But the general state transition should be the same.

  • For compatibility, if gpu_predictor or gpu_hist is chosen, Consistent device must also be specified, otherwise, there will be an error. By consistent, it means the device should be set as CUDA:x. This is a breaking change, but can be handled with a crafted error message.
  • If device is selected to be CUDA, then the tree method must be one of the {hist, gpu_hist, auto}. All of them will become gpu_hist internally. For any other tree methods, XGBoost will throw a not implemented error. We can have approx running on
    GPU if needed, but that's beyond the scope of this RFC.
  • For inplace prediction, the device will continue to be chosen automatically, no change is needed.
  • For scikit-learn interface, which uses inplace prediction automatically, the change would be matching the input data type to the device. Or we simply revert the configuration and let the user decide whether inplace prediction is desired. This one is a bit more tricky as it helps reducing memory usage and latency dramatically, especially for dask. We can use more thoughts on this.
  • As for the heuristic of avoiding copying data to GPU for the first prediction, my plan is to remove it and make the copy anyway. The memory usage is unlikely to exceed quantile sketching. Or we can run prediction in batches like the initialization of ellpack.

Based on these rules, we have removed predictor, tree_method, memory conservation heuristic, and data input type from the decision-making process. Lastly, there are the environment and the custom objective. These 2 can be continued to be handled as it's.

@trivialfis
Copy link
Member Author

cc @RAMitchell @hcho3 @wbo4958 @JohnZed @dantegd @vepifanov @ShvetsKS @pseudotensor

This might be the most significant breaking change in a long time. Please help with comments and suggestions.

@trivialfis
Copy link
Member Author

A previous attempt is at https://github.com/dmlc/xgboost/pull/6971/files . I have written some thoughts in the code comment there but largely summarized here.

@wbo4958
Copy link
Contributor

wbo4958 commented Oct 11, 2021

Looks like JVM can follow your suggestion easily.

@RAMitchell
Copy link
Member

First example looks good to me:

with xgboost.config_context(device="CUDA:0"):
    Xy = xgb.DMatrix(X, y)
    xgboost.train({"tree_method": "hist"}, Xy)
    xgboost.predict(Xy)

Is going to be slightly tedious to implement as we have to change every language binding, but seems like a very positive change.

@trivialfis
Copy link
Member Author

trivialfis commented Oct 13, 2021

For implementing the change, I would like to create an independent branch in dmlc during development so that we can run CI with incremental changes.

trivialfis added a commit that referenced this issue Jan 28, 2022
This is the one last PR for removing omp global variable.

* Add context object to the `DMatrix`.  This bridges `DMatrix` with #7308 .
* Require context to be available at the construction time of booster.
* Add `n_threads` support for R csc DMatrix constructor.
* Remove `omp_get_max_threads` in R glue code.
* Remove threading utilities that rely on omp global variable.
@trivialfis
Copy link
Member Author

trivialfis commented Feb 17, 2023

I'm working on this now:

We won't use the global variable and will keep each context local to booster instead. This saves us from handling multi-threaded applications.

cc @razdoburdin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants