Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove arguments related to cost-savings #1230

Merged
merged 5 commits into from
Jul 26, 2024

Conversation

amahussein
Copy link
Collaborator

@amahussein amahussein commented Jul 25, 2024

Signed-off-by: Ahmed Hussein (amahussein) a@ahussein.me

Fixes #1229, Fixes #1099

This PR is to avoid errors that could be triggered by passing cost-savings arguments.
More cleaning of dead-code can be part of the parent issue- #1221

  • remove the legacy spark_rapids_user_tools cmd
  • remove qualification arguments related to cost-savings
  • disable grouping of results by row_name
  • the file qualification_summary_full.csv is omitted

The following arguments in rapids_tools qualification cmd:

estimation_model: str = None, (because xgboost is the only option)
cpu_cluster_price: float = None,
estimated_gpu_cluster_price: float = None,
cpu_discount: int = None,
gpu_discount: int = None,
global_discount: int = None,
gpu_cluster_recommendation

Fix Cluster Recommendation when CPU cluster cannot be created

Additionally, this PR fixes the issue where we do not generate a cluster recommendation when CPU cluster cannot be created (e.g., no matching executor instance found for the required number of cores).

Approach

Scala tool now generates a recommended GPU cluster per app basis NVIDIA/spark-rapids-tools#1188. For the case when CPU cluster is not provided, we should use the values from Scala tool output for our GPU cluster recommendation instead of python's cpu<->gpu core matching.

Output

Case 1: CPU cluster is not passed and we infer CPU cluster for each app

Logs (for each app):

INFO rapids.tools.cluster_inference: For App ID: app-20200423033538-0000, Unable to infer CPU cluster. Reason - No matching executor instance found for num cores = 80
INFO rapids.tools.cluster_recommender: For App ID: app-20200423033538-0000, CPU cluster: N/A; Recommended GPU cluster: <Driver: m6gd.xlarge, Executor: 16 X g5.4xlarge>
INFO rapids.tools.cluster_recommender: For App ID: app-20210509200722-0001, Inferred CPU cluster: <Driver: m6gd.xlarge, Executor: 1 X m6gd.4xlarge>; Recommended GPU cluster: <Driver: m6gd.xlarge, Executor: 1 X g5.4xlarge>

Final Result:

+----+---------------------+-------------------------+-----------------+----------------------------+------------------------------+-----------------------------+
|    | App Name            | App ID                  | Estimated GPU   | Qualified Node             | Full Cluster                 | GPU Config                  |
|    |                     |                         | Speedup         | Recommendation             | Config                       | Recommendation              |
|    |                     |                         | Category**      |                            | Recommendations*             | Breakdown*                  |
|----+---------------------+-------------------------+-----------------+----------------------------+------------------------------+-----------------------------|
|  1 | spark_test_apps.py  | app-20200423033538-0000 | Large           | g5.4xlarge                 | app-20200423033538-0000.conf | app-20200423033538-0000.log |
|  2 | Spark shell         | app-20210509200722-0001 | Small           | m6gd.4xlarge to g5.4xlarge | app-20210509200722-0001.conf | app-20210509200722-0001.log |
+----+---------------------+-------------------------+-----------------+----------------------------+------------------------------+-----------------------------+

Case 2: CPU cluster is passed as input (--cluster <cluster>)

Logs (for all apps):

INFO rapids.tools.cluster_recommender: CPU cluster: <Driver: m6gd.xlarge, Executor: 2 X m6gd.xlarge>; Recommended GPU cluster: <Driver: m6gd.xlarge, Executor: 2 X g5.xlarge>

Final Result:

+----+---------------------+-------------------------+-----------------+--------------------------+------------------------------+-----------------------------+
|    | App Name            | App ID                  | Estimated GPU   | Qualified Node           | Full Cluster                 | GPU Config                  |
|    |                     |                         | Speedup         | Recommendation           | Config                       | Recommendation              |
|    |                     |                         | Category**      |                          | Recommendations*             | Breakdown*                  |
|----+---------------------+-------------------------+-----------------+--------------------------+------------------------------+-----------------------------|
|  1 | spark_test_apps.py  | app-20200423033538-0000 | Large           | m6gd.xlarge to g5.xlarge | app-20200423033538-0000.conf | app-20200423033538-0000.log |
|  2 | Spark shell         | app-20210509200722-0001 | Small           | m6gd.xlarge to g5.xlarge | app-20210509200722-0001.conf | app-20210509200722-0001.log |
+----+---------------------+-------------------------+-----------------+--------------------------+------------------------------+-----------------------------+

PR for this change: amahussein#13

Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

Fixes NVIDIA#1229

- remove the legacy `spark_rapids_user_tools` cmd
- remove qualification arguments related to cost-savings
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>

Fixes NVIDIA#1099

- disable grouping of results by row_name
- the file `qualification_summary_full.csv` is omitted
@amahussein amahussein added user_tools affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) labels Jul 25, 2024
@amahussein amahussein self-assigned this Jul 25, 2024
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
amahussein and others added 2 commits July 26, 2024 00:06
Signed-off-by: Ahmed Hussein (amahussein) <a@ahussein.me>
* Fix node recommendation when CPU cluster cannot be determined

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

* Move cluster cols to config file

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>

---------

Signed-off-by: Partho Sarthi <psarthi@nvidia.com>
Copy link
Collaborator

@parthosa parthosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein for fixing these.

Copy link
Collaborator

@cindyyuanjiang cindyyuanjiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein!

Copy link
Collaborator

@nartal1 nartal1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amahussein !

@amahussein
Copy link
Collaborator Author

Thanks @parthosa for appending to this PR.
Thanks @cindyyuanjiang and @nartal1 for revieweing

@amahussein amahussein merged commit ed91cc0 into NVIDIA:dev Jul 26, 2024
14 checks passed
@amahussein amahussein deleted the spark-rapids-tools-1221-cost-args branch July 26, 2024 21:48
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affect-output A change that modifies the output (add/remove/rename files, add/remove/rename columns) user_tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TASK] Remove arguments related to cost-savings [FEA] Disable grouping applications by name
4 participants