From 526a4691704f674f7845bdd476fa845ed356630d Mon Sep 17 00:00:00 2001
From: fis <jm.yuan@outlook.com>
Date: Wed, 26 Oct 2022 02:17:19 +0800
Subject: [PATCH 1/5] [pyspark] Improve tutorial on enabling GPU support. [skip
 ci]

---
 doc/tutorials/spark_estimator.rst | 93 +++++++++++++++++++------------
 1 file changed, 58 insertions(+), 35 deletions(-)

diff --git a/doc/tutorials/spark_estimator.rst b/doc/tutorials/spark_estimator.rst
index acacada0b3eb..2bd242ca5ad7 100644
--- a/doc/tutorials/spark_estimator.rst
+++ b/doc/tutorials/spark_estimator.rst
@@ -83,17 +83,50 @@ generate result dataset with 3 new columns:
 XGBoost PySpark GPU support
 ***************************
 
-XGBoost PySpark supports GPU training and prediction. To enable GPU support, first you
-need to install the XGBoost and the `cuDF <https://docs.rapids.ai/api/cudf/stable/>`_
-package. Then you can set `use_gpu` parameter to `True`.
+XGBoost PySpark fully supports GPU acceleration, users are not only able to enable
+efficient training but also utilize their GPUs for the whole PySpark pipeline including
+ETL and inference. In below sections, we will walk through an example of training on a
+PySpark standalone GPU cluster. To get started, first we need to install some additional
+packages, then we can set the `use_gpu` parameter to `True`.
 
-Below tutorial demonstrates how to train a model with XGBoost PySpark GPU on Spark
-standalone cluster.
+Prepare the necessary packages
+==============================
+
+Aside from the PySpark and XGBoost modules, we also need the `cuDF
+<https://docs.rapids.ai/api/cudf/stable/>`_ package for handling Spark dataframe. We
+recommend using either Conda or Virtualenv to manage python dependencies for PySpark
+jobs. Please refer to `How to Manage Python Dependencies in PySpark
+<https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_
+for more details on PySpark dependency management.
+
+In short, to create a Python environment that can be sent to a remote cluster using
+virtualenv and pip:
+
+.. coda-black:: bash
+
+  python -m venv xgboost_env
+  source xgboost_env/bin/activate
+  pip install pyarrow pandas venv-pack cudf xgboost
+  venv-pack -o xgboost_env.tar.gz
+
+with conda:
+
+.. code-block:: bash
+
+  conda create -y -n xgboost_env -c conda-forge conda-pack python=3.9
+  conda activate xgboost_env
+  # use conda when the supported version of xgboost (1.7) is released on conda-forge
+  pip install xgboost
+  conda install cudf pyarrow pandas -c rapids -c nvidia -c conda-forge
+  conda pack -f -o xgboost_env.tar.gz
 
 
 Write your PySpark application
 ==============================
 
+Below snippet is a toy example for training xgboost model with PySpark. Notice that we are
+using a list of feature names and the additional parameter ``use_gpu``:
+
 .. code-block:: python
 
   from xgboost.spark import SparkXGBRegressor
@@ -127,26 +160,11 @@ Write your PySpark application
   predict_df = model.transform(test_df)
   predict_df.show()
 
-Prepare the necessary packages
-==============================
-
-We recommend using Conda or Virtualenv to manage python dependencies
-in PySpark. Please refer to
-`How to Manage Python Dependencies in PySpark <https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html>`_.
-
-.. code-block:: bash
-
-  conda create -y -n xgboost-env -c conda-forge conda-pack python=3.9
-  conda activate xgboost-env
-  pip install xgboost
-  conda install cudf -c rapids -c nvidia -c conda-forge
-  conda pack -f -o xgboost-env.tar.gz
-
 
 Submit the PySpark application
 ==============================
 
-Assuming you have configured your Spark cluster with GPU support, if not yet, please
+Assuming you have configured your Spark cluster with GPU support. Otherwise, please
 refer to `spark standalone configuration with GPU support <https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-on-prem.html#spark-standalone-cluster>`_.
 
 .. code-block:: bash
@@ -158,10 +176,13 @@ refer to `spark standalone configuration with GPU support <https://nvidia.github
     --master spark://<master-ip>:7077 \
     --conf spark.executor.resource.gpu.amount=1 \
     --conf spark.task.resource.gpu.amount=1 \
-    --archives xgboost-env.tar.gz#environment \
+    --archives xgboost_env.tar.gz#environment \
     xgboost_app.py
 
 
+The submit command sends the Python environment created by pip or conda along with the
+specification of GPU allocation. We will revisit this command later on.
+
 Model Persistence
 =================
 
@@ -186,26 +207,27 @@ To export the underlying booster model used by XGBoost:
   # the same booster object returned by xgboost.train
   booster: xgb.Booster = model.get_booster()
   booster.predict(...)
-  booster.save_model("model.json")
+  booster.save_model("model.json") # or model.ubj, depending on your choice of format.
 
-This booster is shared by other Python interfaces and can be used by other language
-bindings like the C and R packages. Lastly, one can extract a booster file directly from
-saved spark estimator without going through the getter:
+This booster is not only shared by other Python interfaces but also used by all the
+XGBoost bindings including the C, Java, and the R package. Lastly, one can extract the
+booster file directly from a saved spark estimator without going through the getter:
 
 .. code-block:: python
 
   import xgboost as xgb
   bst = xgb.Booster()
+  # Loading the model saved in previous snippet
   bst.load_model("/tmp/xgboost-pyspark-model/model/part-00000")
 
-Accelerate the whole pipeline of xgboost pyspark
-================================================
 
-With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_,
-you can accelerate the whole pipeline (ETL, Train, Transform) for xgboost pyspark
-without any code change by leveraging GPU.
+Accelerate the whole pipeline for xgboost pyspark
+=================================================
 
-Below is a simple example submit command for enabling GPU acceleration:
+With `RAPIDS Accelerator for Apache Spark <https://nvidia.github.io/spark-rapids/>`_, you
+can leverage GPUs to accelerate the whole pipeline (ETL, Train, Transform) for xgboost
+pyspark without any Python code change. An example submit command is shown below with
+additional spark configurations and dependencies:
 
 .. code-block:: bash
 
@@ -219,8 +241,9 @@ Below is a simple example submit command for enabling GPU acceleration:
     --packages com.nvidia:rapids-4-spark_2.12:22.08.0 \
     --conf spark.plugins=com.nvidia.spark.SQLPlugin \
     --conf spark.sql.execution.arrow.maxRecordsPerBatch=1000000 \
-    --archives xgboost-env.tar.gz#environment \
+    --archives xgboost_env.tar.gz#environment \
     xgboost_app.py
 
-When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python are
-required for the acceleration.
+When rapids plugin is enabled, both of the JVM rapids plugin and the cuDF Python package
+are required. More configuration options can be found in the RAPIDS link above along with
+details on the plugin.

From 7abbd5803783551b280941b28471dc3040bec4a9 Mon Sep 17 00:00:00 2001
From: fis <jm.yuan@outlook.com>
Date: Wed, 26 Oct 2022 02:35:37 +0800
Subject: [PATCH 2/5] typo [skip ci]

---
 doc/tutorials/spark_estimator.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/tutorials/spark_estimator.rst b/doc/tutorials/spark_estimator.rst
index 2bd242ca5ad7..56809706130d 100644
--- a/doc/tutorials/spark_estimator.rst
+++ b/doc/tutorials/spark_estimator.rst
@@ -102,7 +102,7 @@ for more details on PySpark dependency management.
 In short, to create a Python environment that can be sent to a remote cluster using
 virtualenv and pip:
 
-.. coda-black:: bash
+.. code-block:: bash
 
   python -m venv xgboost_env
   source xgboost_env/bin/activate
@@ -124,7 +124,7 @@ with conda:
 Write your PySpark application
 ==============================
 
-Below snippet is a toy example for training xgboost model with PySpark. Notice that we are
+Below snippet is a small example for training xgboost model with PySpark. Notice that we are
 using a list of feature names and the additional parameter ``use_gpu``:
 
 .. code-block:: python

From e6e35492f5eccdeb16fe9d0ffb504d3d3e32de00 Mon Sep 17 00:00:00 2001
From: fis <jm.yuan@outlook.com>
Date: Wed, 26 Oct 2022 02:56:52 +0800
Subject: [PATCH 3/5] Fix cuDF install.

---
 doc/tutorials/spark_estimator.rst | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/doc/tutorials/spark_estimator.rst b/doc/tutorials/spark_estimator.rst
index 56809706130d..fe4e1bc5d03a 100644
--- a/doc/tutorials/spark_estimator.rst
+++ b/doc/tutorials/spark_estimator.rst
@@ -106,7 +106,9 @@ virtualenv and pip:
 
   python -m venv xgboost_env
   source xgboost_env/bin/activate
-  pip install pyarrow pandas venv-pack cudf xgboost
+  pip install pyarrow pandas venv-pack xgboost
+  # https://rapids.ai/pip.html#install
+  pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
   venv-pack -o xgboost_env.tar.gz
 
 with conda:

From 23b04d8f01cbd0b756da8a0dc39616305482a233 Mon Sep 17 00:00:00 2001
From: fis <jm.yuan@outlook.com>
Date: Wed, 26 Oct 2022 02:58:10 +0800
Subject: [PATCH 4/5] remove dask [skip ci]

---
 doc/tutorials/spark_estimator.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/tutorials/spark_estimator.rst b/doc/tutorials/spark_estimator.rst
index fe4e1bc5d03a..d8d31d3cc024 100644
--- a/doc/tutorials/spark_estimator.rst
+++ b/doc/tutorials/spark_estimator.rst
@@ -108,7 +108,7 @@ virtualenv and pip:
   source xgboost_env/bin/activate
   pip install pyarrow pandas venv-pack xgboost
   # https://rapids.ai/pip.html#install
-  pip install cudf-cu11 dask-cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
+  pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
   venv-pack -o xgboost_env.tar.gz
 
 with conda:

From 0d18d94b72c18978b8ba9a8187faf1c11a79b7e4 Mon Sep 17 00:00:00 2001
From: Jiaming Yuan <jm.yuan@outlook.com>
Date: Wed, 26 Oct 2022 15:38:55 +0800
Subject: [PATCH 5/5] Apply suggestions from code review [skip ci]

Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu>
---
 doc/tutorials/spark_estimator.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/doc/tutorials/spark_estimator.rst b/doc/tutorials/spark_estimator.rst
index d8d31d3cc024..44e7a957513b 100644
--- a/doc/tutorials/spark_estimator.rst
+++ b/doc/tutorials/spark_estimator.rst
@@ -83,7 +83,7 @@ generate result dataset with 3 new columns:
 XGBoost PySpark GPU support
 ***************************
 
-XGBoost PySpark fully supports GPU acceleration, users are not only able to enable
+XGBoost PySpark fully supports GPU acceleration. Users are not only able to enable
 efficient training but also utilize their GPUs for the whole PySpark pipeline including
 ETL and inference. In below sections, we will walk through an example of training on a
 PySpark standalone GPU cluster. To get started, first we need to install some additional
@@ -111,7 +111,7 @@ virtualenv and pip:
   pip install cudf-cu11 --extra-index-url=https://pypi.ngc.nvidia.com
   venv-pack -o xgboost_env.tar.gz
 
-with conda:
+With Conda:
 
 .. code-block:: bash