Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pyspark] Support stage-level scheduling for training #9519

Merged
merged 6 commits into from
Oct 17, 2023

Conversation

wbo4958
Copy link
Contributor

@wbo4958 wbo4958 commented Aug 24, 2023

XGBoost training task requires taking the whole GPU exclusively, so we have limited there's only 1 task running at the same time even for the ETL stages, which typically decreases the parallelism of spark tasks, and finally has some side-effect for the whole performance of xgboost end-2-end training.

The stage-level scheduling we are using in this PR does not change executor resources and hence does not require dynamic allocation to be enabled. The original executors are retained. The features used there, however, are new in Spark 3.4. We also changed CPU task resources therein to ensure only 1 task per gpu just for that barrier stage.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Aug 24, 2023

@WeichenXu123 @trivialfis @eordentlich please help to review, Thx

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the excellent feature! You might need to update the document as well.

python-package/xgboost/spark/core.py Outdated Show resolved Hide resolved
Comment on lines 974 to 994
# Each training task requires cpu cores > total executor cores//2 + 1 which can
# make the tasks be sent to different executors.
#
# Please note that we can't set task_cpu_cores to the value which is smaller than
# total executor cores/2 because only task_gpu_amount can't make sure the tasks be
# sent to different executor even task_gpus=1.0

# Spark-rapids is a project to leverage GPUs to accelerate spark SQL and it can
# really help the performance. If spark-rapids is enabled. we don't allow other
# ETL gpu tasks running alongside training tasks to avoid OOM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this comment. Let me try to rephrase it and see if I got it right.

To send the tasks to different executors, the task must require at least `total executor cores // 2 + 1` CPU cores.

Without configuring the `task_cpu_cores` to a larger value, we can't make sure tasks are sent to different executors even if `task_gpu_amount` is set to 1.

When the `spark-rapids` plugin is enabled for SQL acceleration, we need to prevent other GPU-based ETL tasks from running during training to avoid OOM errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update the comments with a cleaner explanation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


# task_gpus means how many gpu slots the task requires in a single GPU,
# it doesn't mean how many gpu shares it would like to require, so we
# can set it to any value of (0, 0.5] or 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but not [0.5, 1)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, spark will throw exception if it's in [0.5, 1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bug in spark, I deleted the comment.

Comment on lines 974 to 986
# Each training task requires cpu cores > total executor cores//2 + 1 which can
# make the tasks be sent to different executors.
Copy link
Contributor

@WeichenXu123 WeichenXu123 Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hacky, and it causes CPU cores wasted, assuming one xgboost train worker can only use one cpu core, but it makes spark allocate multiple CPU cores to the task, and these CPU resources cannot be used by other spark tasks until the xgboost train worker completes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. It will waste some CPU cores. Let's assume executor_cores=12, then the training task will require 7, but there are still 5 CPU cores left for the other spark task.

The ideal way to limit task number should rely on task_gpu_amount, but the current spark implementation has not supported it. We can report an issue for Spark and once Spark fixes that issue, we can rework this.

Copy link
Contributor

@WeichenXu123 WeichenXu123 Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q:

make the tasks be sent to different executors.

Why we need this rule ? one executor can hold multiple spark tasks, any issues it causes ?, e.g., we can config one exeuctor owns 4 cpu cores, 4 GPU cores,
then in this exeuctor we can start 4 xgboost train workers, each xgboost worker is allocated with 1 cpu and 1 GPU exclusively

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has bypassed this scenario that the executor.gpu.amount > 1, since it requires some special configuration to make it work.

https://github.com/dmlc/xgboost/pull/9519/files#diff-8900fd245992de81e0a7b8cb7fcd18c37bae4473e8672e9b11fd7010cac358cfR929-R937

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wbo4958

Then I think we can directly set executor.gpu.amount=1 and executor.cores=1 when we start spark application ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeichenXu123, we can't use gpu resources to limit the task number. I had a design doc (supporting stage-level scheduling) for spark-rapids-ml with a similar design pattern with xgboost. the doc link can be found at https://docs.google.com/document/d/1C6s1nqNkOl0roNtRA5x1JPUuM2jykuR00JyMzgt0HvU/edit?usp=sharing which explained why we can't use GPU resources to do the task limitation.

Copy link
Contributor

@WeichenXu123 WeichenXu123 Oct 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is bug:

Repro code:

import org.apache.spark.resource.{ResourceProfileBuilder, TaskResourceRequests}

val rdd = sc.range(0, 100, 1, 4)
var rdd1 = rdd.repartition(3)
val treqs = new TaskResourceRequests().cpus(1).resource("gpu", 1.0)
val rp = new ResourceProfileBuilder().require(treqs).build
rdd1 = rdd1.mapPartitions(iter => {Thread.sleep(30000); iter}).withResources(rp)
rdd1.collect()  # 3 spark tasks run in parallel on one executor
spark-shell --master spark://192.168.31.236:7077 --conf spark.executor.cores=12 --conf spark.executor.resource.gpu.amount=1

Let's see whether spark team can address the bug.
If it can't be fixed for now, let's use current workaround.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @WeichenXu123, Yeah, looks like it is a spark bug. But Using the task.cpus to limit the task number in this PR can still achieve the same goal as using the task.gpu.amount, except wasting some cpus cores. Can we merge this PR first, and change it after spark fixes this bug?

@wbo4958
Copy link
Contributor Author

wbo4958 commented Aug 25, 2023

I did the performance test

Env:

Spark standalone cluster with 2 Nodes each having 12 CPU cores and 64G mem, and 1 TITAN V GPU with 12 G mem.

Dataset:

Mortgage, total 255, 962, 232 rows, 27 features + 1 label

Result,

  • W/O stage-level scheduling,
    .config("spark.executor.memory", "60g")
    .config("spark.executor.cores", 12)
    .config("spark.task.cpus", 12)
    .config("spark.executor.resource.gpu.amount", 1)
    .config("spark.task.resource.gpu.amount", 1)
    .config("spark.sql.execution.arrow.maxRecordsPerBatch", 2000000) 

The training times are: 532.36, 534.46, average: 533.41

  • W/ stage-level scheduling
    .config("spark.executor.memory", "60g")
    .config("spark.executor.cores", 12)
    .config("spark.task.cpus", 1)
    .config("spark.executor.resource.gpu.amount", 1)
    .config("spark.task.resource.gpu.amount", 0.08)
    .config("spark.sql.execution.arrow.maxRecordsPerBatch", 2000000) 

The training times are: 383.34, 383.11, average: 383.225

The ratio is 533.41/383.225 = 1.39, which means the PR can have 39% speedup.

The ETL phase is just reading the parquet file and doing the repartition, if we involve more real ETL operations, I guess, the stage-level scheduling PR can bring more speedup.

@eordentlich
Copy link

Any update on the status of this PR?

@wbo4958
Copy link
Contributor Author

wbo4958 commented Oct 11, 2023

Hi @WeichenXu123, Could you help to review it again? Thx very much.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM,

once #9519 (comment) is fixed, please file a followup PR to remove workaround code.

@wbo4958
Copy link
Contributor Author

wbo4958 commented Oct 16, 2023

Hi @trivialfis, Could you help trigger CI ?

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please provide an example usage that gets everything right (with no warning), and add comments on the example for each configuration with an explanation for why the configuration is set in such a way? If I were a user I would struggle to understand and address all potential warnings.

python-package/xgboost/spark/core.py Outdated Show resolved Hide resolved
python-package/xgboost/spark/core.py Outdated Show resolved Hide resolved
Comment on lines 974 to 994
# Each training task requires cpu cores > total executor cores//2 + 1 which can
# make the tasks be sent to different executors.
#
# Please note that we can't set task_cpu_cores to the value which is smaller than
# total executor cores/2 because only task_gpu_amount can't make sure the tasks be
# sent to different executor even task_gpus=1.0

# Spark-rapids is a project to leverage GPUs to accelerate spark SQL and it can
# really help the performance. If spark-rapids is enabled. we don't allow other
# ETL gpu tasks running alongside training tasks to avoid OOM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please update the comments with a cleaner explanation?

@wbo4958
Copy link
Contributor Author

wbo4958 commented Oct 16, 2023

Could you please provide an example usage that gets everything right (with no warning), and add comments on the example for each configuration with an explanation for why the configuration is set in such a way? If I were a user I would struggle to understand and address all potential warnings.

I will file a follow-up PR to add related doc


if int(executor_cores) == 1:
# there will be only 1 task running at any time.
self.logger.info(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this info instead of warning? Can we pick one consistently? (I like info a bit more, but feel free to pick one as long as it's consistent)

Copy link
Member

@trivialfis trivialfis Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wbo4958 Let's draw the line here:

  • warning for things that users should fix/address, similar to a bug but slightly less critical.
  • info for things that "good to know", but totally fine if I don't know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to info

python-package/xgboost/spark/core.py Show resolved Hide resolved

if float(task_gpu_amount) == float(executor_gpus):
self.logger.warning(
"The configuration of cores (exec=%s, task=%s, runnable tasks=1) will "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this CPU cores or GPU cores (I don't know what a GPU core is), but the condition is testing the number of GPUs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's testing how many tasks will run at the same time on the executor.

task_gpu_amount is a fraction value. we can use executor_gpus / task_gpu_amount to get the concurrent tasks.

Copy link
Member

@trivialfis trivialfis Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are raising a warning, could you please revise the message so that users understand what you mean by "cores"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this duplicated message since spark itself will throw a warning on the wasted resources.

python-package/xgboost/spark/core.py Show resolved Hide resolved

if not _is_standalone_or_localcluster(sc):
self.logger.warning(
"Stage-level scheduling in xgboost requires spark standalone or "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be used on other types of clusters (maybe in the future)? Is there any cloud service like EMR or Dataproc can be used with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. from spark 3.5.1, we can enable this feature on spark standalone/ k8s / yarn cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a condition to check for 3.5.1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.5.1 is not released yet. I will file another PR once it's available.

@trivialfis trivialfis merged commit 4d1607e into dmlc:master Oct 17, 2023
22 checks passed
@wbo4958 wbo4958 deleted the stage-level branch October 17, 2023 23:39
trivialfis pushed a commit to trivialfis/xgboost that referenced this pull request Oct 18, 2023
trivialfis added a commit that referenced this pull request Oct 18, 2023
Co-authored-by: Bobby Wang <wbo4958@gmail.com>
razdoburdin added a commit to IntelPython/xgboost_oneapi that referenced this pull request Oct 26, 2023
* [backport] Support pandas 2.1.0. (dmlc#9557) (dmlc#9655)

* [backport] Add support for cgroupv2. (dmlc#9651) (dmlc#9656)

* Bump version to 2.0.1. (dmlc#9660)

* [backport] [CI] Pull CentOS 7 images from NGC (dmlc#9666) (dmlc#9668)

* Fix build for GCC 8.x (dmlc#9670)

* [backport][pyspark] Support stage-level scheduling (dmlc#9519) (dmlc#9686)

Co-authored-by: Bobby Wang <wbo4958@gmail.com>

* Fix build for AppleClang 11 (dmlc#9684)

* Fix libpath logic for Windows (dmlc#9687)

* [CI] Build libxgboost4j.dylib for Intel Mac (dmlc#9704)

* [jvm-packages] Remove hard dependency on libjvm (dmlc#9698) (dmlc#9705)

* Use sys.base_prefix instead of sys.prefix (dmlc#9711)

* Use sys.base_prefix instead of sys.prefix

* Update libpath.py too

---------

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
Co-authored-by: Bobby Wang <wbo4958@gmail.com>
razdoburdin added a commit to IntelPython/xgboost_oneapi that referenced this pull request Oct 26, 2023
* [backport] Support pandas 2.1.0. (dmlc#9557) (dmlc#9655)

* [backport] Add support for cgroupv2. (dmlc#9651) (dmlc#9656)

* Bump version to 2.0.1. (dmlc#9660)

* [backport] [CI] Pull CentOS 7 images from NGC (dmlc#9666) (dmlc#9668)

* Fix build for GCC 8.x (dmlc#9670)

* [backport][pyspark] Support stage-level scheduling (dmlc#9519) (dmlc#9686)

Co-authored-by: Bobby Wang <wbo4958@gmail.com>

* Fix build for AppleClang 11 (dmlc#9684)

* Fix libpath logic for Windows (dmlc#9687)

* [CI] Build libxgboost4j.dylib for Intel Mac (dmlc#9704)

* [jvm-packages] Remove hard dependency on libjvm (dmlc#9698) (dmlc#9705)

* Use sys.base_prefix instead of sys.prefix (dmlc#9711)

* Use sys.base_prefix instead of sys.prefix

* Update libpath.py too

---------

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
Co-authored-by: Bobby Wang <wbo4958@gmail.com>
razdoburdin pushed a commit to IntelPython/xgboost_oneapi that referenced this pull request Oct 26, 2023
razdoburdin added a commit to razdoburdin/xgboost that referenced this pull request Oct 26, 2023
* [backport] Support pandas 2.1.0. (dmlc#9557) (dmlc#9655)

* [backport] Add support for cgroupv2. (dmlc#9651) (dmlc#9656)

* Bump version to 2.0.1. (dmlc#9660)

* [backport] [CI] Pull CentOS 7 images from NGC (dmlc#9666) (dmlc#9668)

* Fix build for GCC 8.x (dmlc#9670)

* [backport][pyspark] Support stage-level scheduling (dmlc#9519) (dmlc#9686)

Co-authored-by: Bobby Wang <wbo4958@gmail.com>

* Fix build for AppleClang 11 (dmlc#9684)

* Fix libpath logic for Windows (dmlc#9687)

* [CI] Build libxgboost4j.dylib for Intel Mac (dmlc#9704)

* [jvm-packages] Remove hard dependency on libjvm (dmlc#9698) (dmlc#9705)

* Use sys.base_prefix instead of sys.prefix (dmlc#9711)

* Use sys.base_prefix instead of sys.prefix

* Update libpath.py too

---------

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
Co-authored-by: Bobby Wang <wbo4958@gmail.com>
razdoburdin pushed a commit to razdoburdin/xgboost that referenced this pull request Oct 26, 2023
razdoburdin added a commit to IntelPython/xgboost_oneapi that referenced this pull request Nov 23, 2023
* [backport] Support pandas 2.1.0. (dmlc#9557) (dmlc#9655)

* [backport] Add support for cgroupv2. (dmlc#9651) (dmlc#9656)

* Bump version to 2.0.1. (dmlc#9660)

* [backport] [CI] Pull CentOS 7 images from NGC (dmlc#9666) (dmlc#9668)

* Fix build for GCC 8.x (dmlc#9670)

* [backport][pyspark] Support stage-level scheduling (dmlc#9519) (dmlc#9686)

Co-authored-by: Bobby Wang <wbo4958@gmail.com>

* Fix build for AppleClang 11 (dmlc#9684)

* Fix libpath logic for Windows (dmlc#9687)

* [CI] Build libxgboost4j.dylib for Intel Mac (dmlc#9704)

* [jvm-packages] Remove hard dependency on libjvm (dmlc#9698) (dmlc#9705)

* Use sys.base_prefix instead of sys.prefix (dmlc#9711)

* Use sys.base_prefix instead of sys.prefix

* Update libpath.py too

* [backport] Fix using categorical data with the ranker. (dmlc#9753) (dmlc#9778)

* [jvm-packages] Add Scala version suffix to xgboost-jvm package (dmlc#9776)

* Update JVM script (dmlc#9714)

* Bump version to 2.0.2; revamp pom.xml

* Update instructions in prepare_jvm_release.py

* Fix formatting

* fix merging for R-package/configure

---------

Co-authored-by: Jiaming Yuan <jm.yuan@outlook.com>
Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
Co-authored-by: Bobby Wang <wbo4958@gmail.com>
Co-authored-by: Dmitry Razdoburdin <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants