Skip to content

Releases: NVIDIA/spark-rapids-tools

v24.08.2

10 Sep 21:25
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Add end-to-end behavioural tests for the python CLI (#1313)
  • Add documentation for qualx plugins (#1337)
  • Allow spark dependency to be configured dynamically (#1326)
  • Follow-up 1318: Fix QualX fallback with default speedup and duration columns (#1330)
  • Updated models for EMR NDS-H dataset (#1331)

Core

  • [FEA] Add total core seconds in Qualification core tool output (#1320)
  • Add support to MaxBy and MinBy in Qualification tool (#1335)
  • Add safeguards to prevent older attempts from generating metrics output in Scala Tool (#1324)
  • Sync up DAYTIME and YEARMONTH fields with CSV plugin files (#1328)

Miscellaneous

  • Update signoff usage [skip ci] (#1332)

v24.08.1

04 Sep 01:06
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • [DOC] spark_rapids CLI help cmd still shows cost savings (#1317)
  • Fix Qualification and Profiling tools CLI argument shorthands (#1312)
  • Raise error for enum creation from invalid string values (#1300)
  • Append HADOOP_CONF_DIR to the tools CLASSPATH execution cmd (#1308)
  • Fix key error and cross-join error during qualx evaluate (#1298)
  • Qual tool: Print more useful log messages when failures happen downloading dependencies (#1292)
  • Fix --help text for custom_model_file option (#1285)

Core

  • Remove legacy SpeedupFactor from core output files (#1318)
  • Mark decimalsum as supported in Qualification tool (#1323)
  • Mark SMJ as unsupported operator for corner cases in left join (#1309)
  • Remove arguments and code related to the html-report (#1311)
  • Handle SparkRapidsBuildInfoEvent in GPU event logs (#1203)
  • Enable recursive search for event logs by default and optional --no-recursion flag (#1297)
  • Qualification tool support filtering by a filesystem time range (#1299)
  • Skip generating timeline for stages that do not have completion time (#1290)
  • Save core tools logs to output log file (#1269)
  • Qualification tool - Add option to filter by minimum event log size (#1291)
  • Include exception message for unknown app status in core tool (#1281)

Miscellaneous

  • Remove restricted google sheets link and outdated TCO section (#1289)

v24.08.0

13 Aug 02:52
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Remove calculation of gpu cluster recommendation from python tool when cluster argument is passed (#1278)
  • Remove unused argument --target_platform in Python Tool (#1279)
  • Qualification tool: Add output stats file for Execs(operators) (#1225)
  • Include GPU information in the cluster recommendation for Dataproc and OnPrem (#1265)
  • Remove speedup based recommendation column from qual_summary csv (#1268)
  • Fix prediction CSV files for multiple qual directories (#1267)
  • Clean up tools after removing CLI dependency (#1256)
  • Rename cluster shape columns to use 'worker' prefix in the output files and rename metadata file (#1258)
  • Remove CLI dependency in Dataproc _pull_gpu_hw_info implementation (#1245)
  • Replace split_nds with split_train_val (#1252)
  • Update xgboost models and metrics (#1244)
  • Add footnotes for config recommendations and speedup category in top candidate view (#1243)
  • [BUG] Update Dataproc instance catalog for n1 series GPU info (#1242)
  • Improvements in Cluster Config Recommender (#1241)
  • Improve console output from python tool for failed/gpu/photon event logs (#1235)
  • [FEA] Generate and use instance description file for Databricks-Azure platform (#1232)
  • Remove arguments related to cost-savings (#1230)
  • Updated models for latest databricks-aws datasets (#1231)
  • Refactor QualX for Linter and Test Compatibility (#1228)
  • Generate summary metadata file and fix node recommendation in python (#1216)
  • [FEA] Remove gcloud CLI dependency for Dataproc platform (#1223)
  • Updated models for latest dataproc eventlogs (#1226)
  • Remove estimation-model column from qualification summary (#1220)
  • Add option to add features.csv files to training set (#1212)
  • Disable cost saving functionality (#1218)
  • [FEA] Remove CLI dependency for EMR and Databricks-AWS platforms in user tool (#1196)
  • Fix some basic pylint errors in qualx code (#1210)
  • Qual tool tuning rec based on CPU event log coherently recommend tunings and node setup and infer cluster from eventlog (#1188)
  • Add shap command to internal CLI for debugging (#1197)
  • Add internal CLI to generate instance descriptions for CSPs (#1137)
  • [FEA] Support custom XGBoost model file via user tools CLI (#1184)
  • Updated models for new training data (#1186)
  • Add evaluate_summary command to internal CLI (#1185)
  • [DOC] Fix broken link to qualX docs and update python prerequisites (#1180)
  • Bump to certifi-2024.7.4 and urllib3-1.26.19 (#1173)
  • Disable UI-HTML report by default in Qualification tool (#1168)
  • Fix parsing App IDs inside metrics directory in QualX (#1167)
  • Refactor Databricks-AWS Qual tool to cache and process pricing info from DB website (#1141)
  • Add plugin mechanism for dataset-specific preprocessing in qualx (#1148)
  • Unsupported op logic should read action column from qual's output (#1150)
  • Update qualx readme for training (#1140)
  • Disable pylint-unreachable code in tox.ini (#1145)

Core

  • Include GPU information in the cluster recommendation for Dataproc and OnPrem (#1265)
  • [TASK] Optimize the storage of accumulables in core tools (#1263)
  • Sync GetJsonObject support with Rapids-Plugin (#1266)
  • Do not create new StageInfo object (#1261)
  • [FEA] Add support for map_from_arrays in qualification tools (#1248)
  • Rename cluster shape columns to use 'worker' prefix in the output files and rename metadata file (#1258)
  • Fix stage level metrics output csv file (#1251)
  • Handle event logs with wildcards in status report generation (#1237)
  • Fix duplicate records in DataSourceInfo report (#1227)
  • Reduce memory footprint of stageInfo (#1222)
  • Ensure UTF-8 encoding for reading non-english characters (#1211)
  • Sync plugin support for hash-hive and shift operators (#1198)
  • Sync-up the support of parse_url in qualification tool (#1195)
  • Include status information for failed event logs in core tool (#1187)
  • [FEA] Adding Benchmarking classes to evaluate core tools performance (#1169)
  • [BUG] Fix handling of non-english characters in tools output files (#1189)
  • [Bug] Fix java Qual tool handling of --platform argument (#1161)
  • Add all stage metrics to tools output (#1151)
  • Follow-up 1142: remove TODO line (#1146)
  • Mark wholestageCodeGen as shouldRemove when child nodes are removed (#1142)
  • [FEA] Display full failure messages in failed CSV files (#1135)

Miscellaneous

  • Qualification tool: Add option to filter event logs for a maximum file system size (#1275)
  • Qualification tool should print Kryo related recommendations (#1204)
  • Fix header check script to exclude files (#1224)
  • Update header check script for pre-commit hooks (#1219)
  • Follow-up 1189: handle non-english characters in data-output.js (#1208)
  • Update pre-commit hooks to check for headers and white-spaces (#1205)
  • user-tools:Update --help for cluster argument (#1178)
  • Support fine-tuning models (#1174)

v24.06.1

18 Jun 22:44
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Fix Python runtime error caused by numpy 2.0.0 release (#1130)
  • Disable the spark_rapids bootstrap command (#1114)

Core

  • Handle different exception thrown by incomplete eventlogs (#1124)
  • Include number of executors per node in cluster information (#1119)

v24.06.0

12 Jun 20:07
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Add support to Python 3.12 (#1111)
  • user-tools: Update log messages (#1110)
  • Enable xgboost prediction model by default (#1108)
  • Add support to Python3.11 (#1105)
  • Fix nan label issue in training (#1104)
  • Fix qualx app metrics (#1102)
  • clip appDuration to at least Duration (#1096)
  • Fix missing assignment to savings_recommendations (#1098)
  • Handle QualX behaviour when Qual Tool does not generate any outputs (#1095)
  • Fix internal predict CLI and remove preprocessed argument (#1093)
  • Update QualX to return default speedups and fix App Duration for incomplete apps (#1089)
  • fix signature error from overlapping merges (#1084)
  • sync w/ internal repo; update models (#1083)
  • Reduce the maximum number of Java threads in CLI (#1082)
  • Remove using Profiler metrics for QualX and Heuristics (#1080)
  • Port QualX repo and add CLI for train (#1076)
  • User tools fallback to default zone/region (#1054)
  • Handle missing pricing info for user qual tool on Databricks platforms (#1053)
  • Split job and stage level aggregated metrics into different files (#1050)
  • Skip Cluster Inference when CSP CLIs are missing or not configured (#1035)
  • Store Cluster Shape Recommendation in User Tools Qualification Output (#1005)
  • Fix calculation of unsupported operators stage duration percentage (#1006)
  • Update Databricks Azure qual tool to set env variable for ABFS paths (#1016)
  • Add heuristics using stage spill metrics to skip apps (#1002)
  • Fix failure in github workflow's pylint (#1015)
  • Updating qual validation script to directly use top candidate view recommendation (#1001)

Core

  • Fix typo in Profiler class using qual instead of prof (#1113)
  • Fix missing appEndTime in raw_metrics folder (#1092)
  • Sync tools with plugin newly supported operators (#1066)
  • Fix java Qual tool Autotuner output when GPU device is missing (#1085)
  • Update the Qual tool AutoTuner Heuristics against CPU event logs (#1069)
  • Handling FileNotFound exception in AutoTuner (#1065)
  • Handle metric names from legacy spark (#1052)
  • Split job and stage level aggregated metrics into different files (#1050)
  • Refactor ProfileResult classes to implement new interface design and add CSV output to Qual Tool (#1043)
  • Hook up the auto tuner in the qualification tool (#1039)
  • Profiler should identify the delta log ops and generate views for non-delta logs (#1031)
  • Qualification tool - Handle cancelled jobs and stages better and don't skip the app (#1033)
  • [FEA] Generate Status Report for Profiling Tool (#1012)
  • Fix calculation of unsupported operators stage duration percentage (#1006)
  • Fix potential problems and AQE updates in Qual tool (#1021)
  • Sync supported operators with plugin changes and update default score (#1020)
  • Refactor TaskEnd to be accessible by Q/P tools (#1000)

Miscellaneous

  • Bump requests from 2.31.0 to 2.32.2 in /data_validation (#1077)

v24.04.0

07 May 21:20
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • [FEA] Add CLI to run prediction on estimation_model (#961)
  • Adding SHAP predict values as new output file (#982)
  • Update docs for building to clarify to build in a virtual environment (#976)

Core

  • [BUG] Catch Profiler error when app info is empty (#994)
  • Get stages from sqlId for collecting info for output writer functions (#996)
  • Account for joboverhead time in qualification tool estimation (#992)
  • [Followup] Fix handling of clusterTags and SparkVersion in Q/P Tools (#993)
  • Fix handling of clusterTags and SparkVersion in Q/P Tools (#991)
  • Refactor AppBase to use common AppMetaData between Q/P tools (#983)
  • Refactor Stage info code between Q/P tools (#971)

v24.02.4

30 Apr 17:07
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Fix Hadoop Azure version to be compatibe with Spark-3.5.0 (#975)
  • Add speedup categories in qualification summary output (#958)
  • Improve cluster node initialisation for CSPs (#964)

Core

  • Remove databricks profiling recommendation for dynamicFilePruning (#972)
  • Add AQEShuffleRead WriteFiles execs to the supportedOps and score files (#963)
  • [FEA] Automate appending new operators to the platform score sheets (#954)
  • Add support for InSubqueryExec Expression (#960)

Miscellaneous

  • Bump dev version to 24.02.4 (#968)
  • Revert versions back to 24.02.3 (#967)

v24.02.3

24 Apr 17:56
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Cache CLI calls for node instance description (#952)
  • Improve error handling in prediction code (#950)
  • Support dynamic calculation of JVM resources in CLI cmd (#944)
  • Syncup estimation model prediction logic updates (#946)
  • Cluster inference should not run for unsupported platform (#941)
  • Fix invalid values in cluster creation script (#935)
  • Fix core tool doc links and user qualification tool default argument values (#931)
  • Fix gpu cluster recommendation in user tools (#930)
  • Bump idna from 3.4 to 3.7 in /data_validation (#932)
  • Add cluster details in qualification summary output (#921)
  • Refactor find_matches_for_node return values (#920)
  • [FEA] Add and use g5 AWS instances as default for qualification tool output (#898)
  • Add jar argument to spark_rapids CLI (#902)
  • Support driverlog argument in profiler CLI (#897)

Core

  • Followups on handling Photon eventlogs (#953)
  • Sync operators support timestamped 24-04-16 (#951)
  • Add CheckOverflowInTableInsert support: verify absence from physical plan (#942)
  • Fix Notes column in the supported ops CSV files (#933)
  • Improve sync plugin supported CSV python script (#919)
  • Add cluster details in qualification summary output (#921)
  • Add support for unsupported expressions reasons per Exec (#923)
  • Adding more metrics and options for qual validation (#926)
  • Generate cluster details in JSON output (#912)
  • Add Divide and multiple interval expressions as supported (#917)
  • Add support for PythonMapInArrowExec and MapInArrowExec (#913)
  • Re-enable support for GetJsonObject by default (#916)
  • Add support for WindowGroupLimitExec (#906)
  • [FEA] Skip Spark Structured Streaming event logs for Qualification tool (#905)
  • [FEA] Add and use g5 AWS instances as default for qualification tool output (#898)
  • Initial version of qual tool validation script for classification metrics (#903)
  • Fix Delta-core dependency for Spark35+ (#904)
  • Add support for AtomicCreateTableAsSelectExec (#895)
  • Add support for KnownNullable and EphemeralSubstring expressions (#894)
  • Add Support for BloomFilterAggregate and BloomFilterMightContain exprs (#891)
  • [DOC] Update README for sync plugin supported ops script (#893)
  • Add operators to ignore list and update WindowExpr parser (#890)
  • Add support to RoundCeil and RoundFloor expressions (#889)

v24.02.2

27 Mar 20:55
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Override estimated speedups when estimation model is enabled (#885)
  • [FEA] Make top candidates view as the default view in user-tools (#879)
  • Introduce new csv file containing output for all apps before grouping (#875)
  • Fix calculation of unsupported operators stages duration and update output row (#874)
  • Implement top candidate filter for user tools CLI output (#866)

Core

  • [FEA] Skip Databricks Photon jobs at app level in Qualification tool (#886)
  • [FEA] Add Estimation Model to Qualification CLI (#870)
  • Add rootExecutionID to output csv files (#871)
  • [FEA] Generate updated supported CSV files from plugin repo (#847)
  • Add action column to qual execs output (#859)
  • Extend supportLevels in PluginTypeChecker (#863)
  • Propagate Reason/Notes for operators disabled by default from plugin to Qualification tool unsupported operators csv file (#850)

Miscellaneous

  • Bump default Spark-version to 3.5.0 (#877)
  • Update Github actions version (#876)

v24.02.1

15 Mar 01:10
Compare
Choose a tag to compare

Packages

Changes

User Tools

  • Remove redundant initialization scripts from user tools output (#830)
  • [DOC] Update Databricks Azure user tool setup instructions for output format (#826)
  • Estimate cluster instances and generate cost savings (#803)

Core

  • Fix implementation of processSQLPlanMetrics in Profiler (#853)
  • Deduplicate SQL duration wallclock time for databricks eventlog (#810)
  • Consider additional factors in spark.sql.shuffle.partitions recommendation in Autotuner (#722)
  • Fix case matching error In AutoTuner (#828)
  • Fix ReadSchema in Qualification tool and NPE in Profiling tool (#825)
  • AutoTuner does not process arguments skipList and limitedLogic (#812)