Remove arguments related to cost-savings #1230

amahussein · 2024-07-25T06:51:48Z

Signed-off-by: Ahmed Hussein (amahussein) a@ahussein.me

This PR is to avoid errors that could be triggered by passing cost-savings arguments.
More cleaning of dead-code can be part of the parent issue- #1221

remove the legacy spark_rapids_user_tools cmd
remove qualification arguments related to cost-savings
disable grouping of results by row_name
the file qualification_summary_full.csv is omitted

The following arguments in rapids_tools qualification cmd:

estimation_model: str = None, (because xgboost is the only option)
cpu_cluster_price: float = None,
estimated_gpu_cluster_price: float = None,
cpu_discount: int = None,
gpu_discount: int = None,
global_discount: int = None,
gpu_cluster_recommendation

Fix Cluster Recommendation when CPU cluster cannot be created

Additionally, this PR fixes the issue where we do not generate a cluster recommendation when CPU cluster cannot be created (e.g., no matching executor instance found for the required number of cores).

Approach

Scala tool now generates a recommended GPU cluster per app basis NVIDIA/spark-rapids-tools#1188. For the case when CPU cluster is not provided, we should use the values from Scala tool output for our GPU cluster recommendation instead of python's cpu<->gpu core matching.

Output

Case 1: CPU cluster is not passed and we infer CPU cluster for each app

Logs (for each app):

INFO rapids.tools.cluster_inference: For App ID: app-20200423033538-0000, Unable to infer CPU cluster. Reason - No matching executor instance found for num cores = 80
INFO rapids.tools.cluster_recommender: For App ID: app-20200423033538-0000, CPU cluster: N/A; Recommended GPU cluster: <Driver: m6gd.xlarge, Executor: 16 X g5.4xlarge>
INFO rapids.tools.cluster_recommender: For App ID: app-20210509200722-0001, Inferred CPU cluster: <Driver: m6gd.xlarge, Executor: 1 X m6gd.4xlarge>; Recommended GPU cluster: <Driver: m6gd.xlarge, Executor: 1 X g5.4xlarge>

Final Result:

+----+---------------------+-------------------------+-----------------+----------------------------+------------------------------+-----------------------------+
|    | App Name            | App ID                  | Estimated GPU   | Qualified Node             | Full Cluster                 | GPU Config                  |
|    |                     |                         | Speedup         | Recommendation             | Config                       | Recommendation              |
|    |                     |                         | Category**      |                            | Recommendations*             | Breakdown*                  |
|----+---------------------+-------------------------+-----------------+----------------------------+------------------------------+-----------------------------|
|  1 | spark_test_apps.py  | app-20200423033538-0000 | Large           | g5.4xlarge                 | app-20200423033538-0000.conf | app-20200423033538-0000.log |
|  2 | Spark shell         | app-20210509200722-0001 | Small           | m6gd.4xlarge to g5.4xlarge | app-20210509200722-0001.conf | app-20210509200722-0001.log |
+----+---------------------+-------------------------+-----------------+----------------------------+------------------------------+-----------------------------+

Case 2: CPU cluster is passed as input (`--cluster <cluster>`)

Logs (for all apps):

INFO rapids.tools.cluster_recommender: CPU cluster: <Driver: m6gd.xlarge, Executor: 2 X m6gd.xlarge>; Recommended GPU cluster: <Driver: m6gd.xlarge, Executor: 2 X g5.xlarge>

Final Result:

+----+---------------------+-------------------------+-----------------+--------------------------+------------------------------+-----------------------------+
|    | App Name            | App ID                  | Estimated GPU   | Qualified Node           | Full Cluster                 | GPU Config                  |
|    |                     |                         | Speedup         | Recommendation           | Config                       | Recommendation              |
|    |                     |                         | Category**      |                          | Recommendations*             | Breakdown*                  |
|----+---------------------+-------------------------+-----------------+--------------------------+------------------------------+-----------------------------|
|  1 | spark_test_apps.py  | app-20200423033538-0000 | Large           | m6gd.xlarge to g5.xlarge | app-20200423033538-0000.conf | app-20200423033538-0000.log |
|  2 | Spark shell         | app-20210509200722-0001 | Small           | m6gd.xlarge to g5.xlarge | app-20210509200722-0001.conf | app-20210509200722-0001.log |
+----+---------------------+-------------------------+-----------------+--------------------------+------------------------------+-----------------------------+

PR for this change: amahussein#13

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me> Fixes NVIDIA#1229 - remove the legacy `spark_rapids_user_tools` cmd - remove qualification arguments related to cost-savings

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me> Fixes NVIDIA#1099 - disable grouping of results by row_name - the file `qualification_summary_full.csv` is omitted

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

user_tools/src/spark_rapids_pytools/rapids/qualification.py

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

* Fix node recommendation when CPU cluster cannot be determined Signed-off-by: Partho Sarthi <psarthi@nvidia.com> * Move cluster cols to config file Signed-off-by: Partho Sarthi <psarthi@nvidia.com> --------- Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

parthosa

Thanks @amahussein for fixing these.

cindyyuanjiang

Thanks @amahussein!

nartal1

Thanks @amahussein !

amahussein · 2024-07-26T21:47:26Z

Thanks @parthosa for appending to this PR.
Thanks @cindyyuanjiang and @nartal1 for revieweing

amahussein added 2 commits July 25, 2024 01:34

Remove arguments related to cost-savings

d5441c6

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me> Fixes NVIDIA#1229 - remove the legacy `spark_rapids_user_tools` cmd - remove qualification arguments related to cost-savings

remove gpu-cluster-reshape and grouping of apps

d3b76e8

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me> Fixes NVIDIA#1099 - disable grouping of results by row_name - the file `qualification_summary_full.csv` is omitted

amahussein added user_tools affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) labels Jul 25, 2024

amahussein self-assigned this Jul 25, 2024

amahussein requested review from nartal1, parthosa and cindyyuanjiang July 25, 2024 06:53

Fix typo in the tools_cli cmd

e519c97

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

cindyyuanjiang reviewed Jul 26, 2024

View reviewed changes

user_tools/src/spark_rapids_pytools/rapids/qualification.py Outdated Show resolved Hide resolved

amahussein and others added 2 commits July 26, 2024 00:06

remove dead code

878343a

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

parthosa approved these changes Jul 26, 2024

View reviewed changes

cindyyuanjiang approved these changes Jul 26, 2024

View reviewed changes

nartal1 approved these changes Jul 26, 2024

View reviewed changes

amahussein merged commit ed91cc0 into NVIDIA:dev Jul 26, 2024
14 checks passed

amahussein deleted the spark-rapids-tools-1221-cost-args branch July 26, 2024 21:48

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove arguments related to cost-savings #1230

Remove arguments related to cost-savings #1230

amahussein commented Jul 25, 2024 •

edited by parthosa

Loading

parthosa left a comment

cindyyuanjiang left a comment

nartal1 left a comment

amahussein commented Jul 26, 2024

Remove arguments related to cost-savings #1230

Remove arguments related to cost-savings #1230

Conversation

amahussein commented Jul 25, 2024 • edited by parthosa Loading

Fix Cluster Recommendation when CPU cluster cannot be created

Approach

Output

Case 1: CPU cluster is not passed and we infer CPU cluster for each app

Logs (for each app):

Final Result:

Case 2: CPU cluster is passed as input (--cluster <cluster>)

Logs (for all apps):

Final Result:

parthosa left a comment

Choose a reason for hiding this comment

cindyyuanjiang left a comment

Choose a reason for hiding this comment

nartal1 left a comment

Choose a reason for hiding this comment

amahussein commented Jul 26, 2024

amahussein commented Jul 25, 2024 •

edited by parthosa

Loading

Case 2: CPU cluster is passed as input (`--cluster <cluster>`)