Databricks 12.2 Support [databricks] #8282

andygrove · 2023-05-12T14:04:20Z

This PR adds support for Databricks 12.2

SQL Plugin

Add shim GpuFileFormatDataWriter.createWriteSummary

Delta Lake

Add new Delta Lake shim for 332db, which is essentially a copy of 330db with minor modifications. I have filed a follow-on issue #8320 for using shimplify in this module and refactoring to remove duplicate code.

Changes

GpuUpdateCommand: Change to call to DeltaLog.assertRemovable
GpuDeleteCommand: Change to call to DeltaLog.assertRemovable
GpuMergeIntoCommand: Accomodate changes in class hierarchy
DetlaLogShim: Changes for getting fileFormat and metadata

The following files are duplicated between 330db and 332db without modification:

GpuDoAutoCompaction
GpuOptimisticTransaction
GpuOptimizeExecutor
InvariantViolationExceptionShim
ShimDeltaUDF

Integration Tests

Update some tests that were skipped for 340 to also be skipped for Databricks 12.2

Status / TODO

Fix plugin build
Fix Delta Lake build
Fix integration test failures
File follow-on issues

Follow-on issues:

Signed-off-by: Andy Grove <andygrove@nvidia.com>

revans2

Just one small comment

jenkins/databricks/build.sh

This reverts commit 69fc628.

…32db

This reverts commit 80cff17.

andygrove · 2023-05-13T00:17:44Z

This was going fine until I ran into this compilation error:

[ERROR] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/SparkShims.scala:70: type arguments [org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand] do not conform to method dataWriteCmd's type parameter bounds [INPUT <: org.apache.spark.sql.execution.command.DataWritingCommand]
[ERROR]     Seq(GpuOverrides.dataWriteCmd[CreateDataSourceTableAsSelectCommand](

CreateDataSourceTableAsSelectCommand no longer implements DataWritingCommand.

andygrove · 2023-05-13T00:28:38Z

This was going fine until I ran into this compilation error:

[ERROR] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/SparkShims.scala:70: type arguments [org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand] do not conform to method dataWriteCmd's type parameter bounds [INPUT <: org.apache.spark.sql.execution.command.DataWritingCommand]
[ERROR]     Seq(GpuOverrides.dataWriteCmd[CreateDataSourceTableAsSelectCommand](

CreateDataSourceTableAsSelectCommand no longer implements DataWritingCommand.

nm, this changed in 3.4

…pids/delta/shims/InvariantViolationExceptionShim.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>

…transaction/tahoe/rapids/GpuOptimisticTransaction.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>

…hims/SparkShims.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>

This commit fixes a compile failure on Databricks, introduced in NVIDIA#8240. Partition.files is a `SerializableFileStatus` on Databricks, as opposed to a `FileStatus` on Apache Spark. This commit solves the problem, without Shims. Signed-off-by: MithunR <mythrocks@gmail.com>

…bricks' into 332db

andygrove · 2023-05-31T22:37:37Z

build

NvTimLiu · 2023-06-01T13:49:16Z

I got the shuffle_test failure as below on Databricks 12.2 runtime only using latest update of this PR, is this a known issue? @andygrove

"update_url": "https://clients2.google.com/service/update2/crx",
.444Z] + exec python /home/ubuntu/spark-rapids/integration_tests/runtests.py --/spark-rapids/integration_tests /home/ubuntu/spark-rapids/integration_tests/src/-rfE '' --std_input_path=/home/ubuntu/spark-rapids/integration_tests/src/test/s '' --junitxml=TEST-pytest-1685618364276402574.xml -m shuffle_test --ks --test_type=nightly
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use ).
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use ).
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use ).
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use ).
 INFO     Initializing findspark because running with xdist worker
 INFO     Checking if add_jars/packages to findspark required
 INFO     SPARK_EVENTLOG_ENABLED is ignored for non-local Spark master and when by the user
 INFO     Initializing findspark because running with xdist worker
 INFO     Checking if add_jars/packages to findspark required
 INFO     SPARK_EVENTLOG_ENABLED is ignored for non-local Spark master and when by the user
njection seed: 1685618364. Set env variable SPARK_RAPIDS_TEST_INJECT_OOM_SEED to 
 INFO     Executing global initialization tasks before test launches
 INFO     Creating directory /home/ubuntu/spark-rapids/integration_tests/target/24-zkgo/hive with permissions 0o777
 INFO     Skipping findspark init because on xdist master
 INFO     Initializing findspark because running with xdist worker
 INFO     Checking if add_jars/packages to findspark required
 INFO     SPARK_EVENTLOG_ENABLED is ignored for non-local Spark master and when it's pre-configured by the user
 ============================= test session starts ==============================
 platform linux -- Python 3.8.10, pytest-7.3.1, pluggy-1.0.0 -- /usr/bin/python
 cachedir: .pytest_cache
 rootdir: /home/ubuntu/spark-rapids/integration_tests
 configfile: pytest.ini
 plugins: order-1.1.0, xdist-3.3.1
 created: 4/4 workers
 INTERNALERROR> def worker_internal_error(self, node, formatted_error):
 INTERNALERROR>         """
 INTERNALERROR>         pytest_internalerror() was called on the worker.
 INTERNALERROR>     
 INTERNALERROR>         pytest_internalerror() arguments are an excinfo and an excrepr, which can't
 INTERNALERROR>         be serialized, so we go with a poor man's solution of raising an exception
 INTERNALERROR>         here ourselves using the formatted message.
 INTERNALERROR>         """
 INTERNALERROR>         self._active_nodes.remove(node)
 INTERNALERROR>         try:
 INTERNALERROR> >           assert False, formatted_error
 INTERNALERROR> E           AssertionError: Traceback (most recent call last):
 INTERNALERROR> E               File "/home/ubuntu/.local/lib/python3.8/site-packages/_pytest/main.py", line 267, in wrap_session
 INTERNALERROR> E                 config.hook.pytest_sessionstart(session=session)
 INTERNALERROR> E               File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_hooks.py", line 265, in __call__
 INTERNALERROR> E                 return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
 INTERNALERROR> E               File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_manager.py", line 80, in _hookexec
 INTERNALERROR> E                 return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
 INTERNALERROR> E               File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_callers.py", line 60, in _multicall
 INTERNALERROR> E                 return outcome.get_result()
 INTERNALERROR> E               File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_result.py", line 60, in get_result
 INTERNALERROR> E                 raise ex[1].with_traceback(ex[2])
 INTERNALERROR> E               File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_callers.py", line 39, in _multicall
 INTERNALERROR> E                 res = hook_impl.function(*args)
 INTERNALERROR> E               File "/home/ubuntu/spark-rapids/integration_tests/src/main/python/spark_init_internal.py", line 142, in pytest_sessionstart
 INTERNALERROR> E                 _s = _sb.enableHiveSupport() \
 INTERNALERROR> E               File "/databricks/spark/python/pyspark/sql/session.py", line 401, in getOrCreate
 INTERNALERROR> E                 else SparkContext.getOrCreate(sparkConf)
 INTERNALERROR> E               File "/databricks/spark/python/pyspark/context.py", line 555, in getOrCreate
 INTERNALERROR> E                 SparkContext(conf=conf or SparkConf())
 INTERNALERROR> E               File "/databricks/spark/python/pyspark/context.py", line 204, in __init__
 INTERNALERROR> E                 self._do_init(
 INTERNALERROR> E               File "/databricks/spark/python/pyspark/context.py", line 294, in _do_init
 INTERNALERROR> E                 self._jsc = jsc or self._initialize_context(self._conf._jconf)
 INTERNALERROR> E               File "/databricks/spark/python/pyspark/context.py", line 428, in _initialize_context
 INTERNALERROR> E                 return self._jvm.JavaSparkContext(jconf)
 INTERNALERROR> E               File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1585, in __call__
 INTERNALERROR> E                 return_value = get_return_value(
 INTERNALERROR> E               File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
 INTERNALERROR> E                 raise Py4JJavaError(
 INTERNALERROR> E             py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
 INTERNALERROR> E             : java.lang.ClassNotFoundException: com.nvidia.spark.rapids.spark332db.RapidsShuffleManager
 INTERNALERROR> E             	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
 INTERNALERROR> E             	at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
 INTERNALERROR> E             	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
 INTERNALERROR> E             	at java.lang.Class.forName0(Native Method)
 INTERNALERROR> E             	at java.lang.Class.forName(Class.java:348)
 INTERNALERROR> E             	at org.apache.spark.util.Utils$.classForName(Utils.scala:263)
 INTERNALERROR> E             	at org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:3053)
 INTERNALERROR> E             	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:440)
 INTERNALERROR> E             	at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:312)
 INTERNALERROR> E             	at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:345)
 INTERNALERROR> E             	at org.apache.spark.SparkContext.<init>(SparkContext.scala:605)
 INTERNALERROR> E             	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:60)
 INTERNALERROR> E             	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 INTERNALERROR> E             	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 INTERNALERROR> E             	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 INTERNALERROR> E             	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
 INTERNALERROR> E             	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
 INTERNALERROR> E             	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
 INTERNALERROR> E             	at py4j.Gateway.invoke(Gateway.java:257)
 INTERNALERROR> E             	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
 INTERNALERROR> E             	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
 INTERNALERROR> E             	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
 INTERNALERROR> E             	at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
 INTERNALERROR> E             	at java.lang.Thread.run(Thread.java:750)
 INTERNALERROR> E           assert False
 INTERNALERROR> 
 INTERNALERROR> ../../../../.local/lib/python3.8/site-packages/xdist/dsession.py:197: AssertionError
 [gw0] node down: Not properly terminated
 replacing crashed worker gw0
 INTERNALERROR> Traceback (most recent call last):
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/_pytest/main.py", line 269, in wrap_session
 INTERNALERROR>     session.exitstatus = doit(config, session) or 0
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/_pytest/main.py", line 323, in _main
 INTERNALERROR>     config.hook.pytest_runtestloop(session=session)
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_hooks.py", line 265, in __call__
 INTERNALERROR>     return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_manager.py", line 80, in _hookexec
 INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_callers.py", line 60, in _multicall
 INTERNALERROR>     return outcome.get_result()
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_result.py", line 60, in get_result
 INTERNALERROR>     raise ex[1].with_traceback(ex[2])
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_callers.py", line 39, in _multicall
 INTERNALERROR>     res = hook_impl.function(*args)
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/xdist/dsession.py", line 122, in pytest_runtestloop
 INTERNALERROR>     self.loop_once()
 INFO     Initializing findspark because running with 
 INFO     Checking if add_jars/packages to findspark 
 INFO     SPARK_EVENTLOG_ENABLED is ignored for non-n it's pre-configured by the 
 level to "WARN".
evel use sc.setLogLevel(newLevel). For SparkR, use 
 INFO     Initializing findspark because running with 
 INFO     Checking if add_jars/packages to findspark 
 INFO     SPARK_EVENTLOG_ENABLED is ignored for non-n it's pre-configured by the 
 ERROR    KeyboardInterrupt while sending command.

abellina · 2023-06-01T13:51:34Z

@andygrove we need to add a shim for RapidsShuffleManager for spark322db. This is why we added the test, because last time we merged a shim and we didn't know it (@nartal1 for the win!!)

andygrove · 2023-06-01T15:28:25Z

Thanks @NvTimLiu and @abellina. The shuffle integration tests are now passing for me with the latest commit.

abellina · 2023-06-01T15:30:43Z

@andygrove change looks good to me.

gerashegalov · 2023-06-01T16:19:37Z

re: RapidsShuffleManager, ordinarily It should have been caught by a scala test SparkShimSuite but we haven't prioritized making them runable on DBR #1320

andygrove · 2023-06-01T16:32:05Z

build

andygrove · 2023-06-02T14:51:22Z

@revans2 @jlowe This is ready for another round of reviews when you have time

jlowe · 2023-06-02T15:11:37Z

integration_tests/src/main/python/delta_lake_delete_test.py

@@ -131,6 +131,7 @@ def generate_dest_data(spark):
 @pytest.mark.parametrize("use_cdf", [True, False], ids=idfn)
 @pytest.mark.parametrize("partition_columns", [None, ["a"]], ids=idfn)
 @pytest.mark.skipif(is_before_spark_320(), reason="Delta Lake writes are not supported before Spark 3.2.x")
+@pytest.mark.skipif(is_databricks122_or_later(), reason="https://github.com/NVIDIA/spark-rapids/issues/8423")


Technically we want all these to be xfail'd tests rather than skipped. We xfail tests that are failing due to bugs and should work at some point in the future when those bugs are fixed. We skip tests that we never expect to work. Definitely not must-fix.

Makes sense.

revans2

Looks fine, but I didn't dig too deeply into the delta code because there is a follow on issue for that.

andygrove · 2023-06-02T16:35:14Z

build

andygrove · 2023-06-02T16:36:48Z

build

prep for 332db

906e5cd

Signed-off-by: Andy Grove <andygrove@nvidia.com>

andygrove self-assigned this May 12, 2023

revans2 reviewed May 12, 2023

View reviewed changes

jenkins/databricks/build.sh Show resolved Hide resolved

andygrove added 4 commits May 12, 2023 10:11

FileIndexOptionsShims for 332db

ebcafd0

GpuOptimizedCreateHiveTableAsSelectCommandShims for 332db

c69a884

Move GpuInsertIntoHiveTableMeta shim to 332db

0ad5112

Move HiveProviderCmdShims to 332db

bf07992

sameerz added the feature request New feature or request label May 12, 2023

andygrove added 13 commits May 12, 2023 10:37

Move SchemaUtilsShims to 332db

8a4b6a0

Move GpuBatchScanExec to 332db

ba4f280

Move BatchScanExecMeta to 332db

8f66e5d

Move SparkShims to 332db

69fc628

Revert "Move SparkShims to 332db"

d4b49f4

This reverts commit 69fc628.

Move SparkDateTimeExceptionShims and SparkDateTimeExceptionShims to 3…

86bf257

…32db

Merge remote-tracking branch 'nvidia/branch-23.06' into 332db

d0592cc

add shim for creating ExecutedWriteSummary

daff5e8

add 332db shim tag for ParquetTimestampAnnotationShims

9560632

revert changes to BatchScanExec

80cff17

Revert "revert changes to BatchScanExec"

4f6c994

This reverts commit 80cff17.

revert changes to BatchScanExec

3007bda

fix a compilation error in GpuBatchScanExec

d318dd7

andygrove added 5 commits May 12, 2023 18:20

try fix CTAS support

46f700a

fix

105e99b

CTAS

a649b21

CTAS

95a00dc

GpuDataSource

6d9cde0

andygrove added 2 commits May 15, 2023 14:43

Save progress

b369058

fix another compilation issue

fe590ba

andygrove and others added 9 commits May 31, 2023 14:18

fix merge conflict

6592573

rename shim classes to avoid conflict

6d5d11b

Update delta-lake/delta-spark332db/src/main/scala/com/nvidia/spark/ra…

82087c4

…pids/delta/shims/InvariantViolationExceptionShim.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>

Update delta-lake/delta-spark332db/src/main/scala/com/databricks/sql/…

b97ce29

…transaction/tahoe/rapids/GpuOptimisticTransaction.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>

Update sql-plugin/src/main/spark332db/scala/com/nvidia/spark/rapids/s…

6df64ce

…hims/SparkShims.scala Co-authored-by: Jason Lowe <jlowe@nvidia.com>

Merge remote-tracking branch 'nvidia/branch-23.08' into 332db

1e4993c

Move some shim code into new MergeIntoCommandMeta

50552c8

Merge remote-tracking branch 'mythrocks/fix-hive-compressed-read-data…

a9b8ed7

…bricks' into 332db

andygrove added 2 commits June 1, 2023 08:13

fix compilation error

dbccf6e

fix 332db shims for RapidsShuffleManager

39f4020

Merge remote-tracking branch 'nvidia/branch-23.08' into 332db

6a3910f

andygrove added 2 commits June 1, 2023 17:43

enable test_optimized_hive_bucketed_fallback for databricks 12.2

bdc3b74

enable test_int96_write_conf_with_write_exec for databricks 12.2

b6d2045

jlowe approved these changes Jun 2, 2023

View reviewed changes

revans2 approved these changes Jun 2, 2023

View reviewed changes

andygrove merged commit 2ed3523 into NVIDIA:branch-23.08 Jun 2, 2023

andygrove deleted the 332db branch June 2, 2023 20:41

jlowe mentioned this pull request Jun 8, 2023

[FEA] Add initial support for Databricks 12.2 ML LTS #7876

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Databricks 12.2 Support [databricks] #8282

Databricks 12.2 Support [databricks] #8282

andygrove commented May 12, 2023 •

edited

Loading

revans2 left a comment

andygrove commented May 13, 2023

andygrove commented May 13, 2023

andygrove commented May 31, 2023

NvTimLiu commented Jun 1, 2023 •

edited

Loading

abellina commented Jun 1, 2023 •

edited

Loading

andygrove commented Jun 1, 2023

abellina commented Jun 1, 2023

gerashegalov commented Jun 1, 2023

andygrove commented Jun 1, 2023

andygrove commented Jun 2, 2023

jlowe Jun 2, 2023

andygrove Jun 2, 2023

revans2 left a comment

andygrove commented Jun 2, 2023

andygrove commented Jun 2, 2023

Databricks 12.2 Support [databricks] #8282

Databricks 12.2 Support [databricks] #8282

Conversation

andygrove commented May 12, 2023 • edited Loading

SQL Plugin

Delta Lake

Integration Tests

Status / TODO

Follow-on issues:

revans2 left a comment

Choose a reason for hiding this comment

andygrove commented May 13, 2023

andygrove commented May 13, 2023

andygrove commented May 31, 2023

NvTimLiu commented Jun 1, 2023 • edited Loading

abellina commented Jun 1, 2023 • edited Loading

andygrove commented Jun 1, 2023

abellina commented Jun 1, 2023

gerashegalov commented Jun 1, 2023

andygrove commented Jun 1, 2023

andygrove commented Jun 2, 2023

jlowe Jun 2, 2023

Choose a reason for hiding this comment

andygrove Jun 2, 2023

Choose a reason for hiding this comment

revans2 left a comment

Choose a reason for hiding this comment

andygrove commented Jun 2, 2023

andygrove commented Jun 2, 2023

andygrove commented May 12, 2023 •

edited

Loading

NvTimLiu commented Jun 1, 2023 •

edited

Loading

abellina commented Jun 1, 2023 •

edited

Loading