Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Databricks 12.2 Support [databricks] #8282

Merged
merged 75 commits into from
Jun 2, 2023
Merged

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented May 12, 2023

Closes #7876

This PR adds support for Databricks 12.2

SQL Plugin

  • Add shim GpuFileFormatDataWriter.createWriteSummary

Delta Lake

Add new Delta Lake shim for 332db, which is essentially a copy of 330db with minor modifications. I have filed a follow-on issue #8320 for using shimplify in this module and refactoring to remove duplicate code.

Changes

  • GpuUpdateCommand: Change to call to DeltaLog.assertRemovable
  • GpuDeleteCommand: Change to call to DeltaLog.assertRemovable
  • GpuMergeIntoCommand: Accomodate changes in class hierarchy
  • DetlaLogShim: Changes for getting fileFormat and metadata

The following files are duplicated between 330db and 332db without modification:

  • GpuDoAutoCompaction
  • GpuOptimisticTransaction
  • GpuOptimizeExecutor
  • InvariantViolationExceptionShim
  • ShimDeltaUDF

Integration Tests

  • Update some tests that were skipped for 340 to also be skipped for Databricks 12.2

Status / TODO

  • Fix plugin build
  • Fix Delta Lake build
  • Fix integration test failures
  • File follow-on issues

Follow-on issues:

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove self-assigned this May 12, 2023
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small comment

jenkins/databricks/build.sh Show resolved Hide resolved
@sameerz sameerz added the feature request New feature or request label May 12, 2023
@andygrove
Copy link
Contributor Author

This was going fine until I ran into this compilation error:

[ERROR] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/SparkShims.scala:70: type arguments [org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand] do not conform to method dataWriteCmd's type parameter bounds [INPUT <: org.apache.spark.sql.execution.command.DataWritingCommand]
[ERROR]     Seq(GpuOverrides.dataWriteCmd[CreateDataSourceTableAsSelectCommand](

CreateDataSourceTableAsSelectCommand no longer implements DataWritingCommand.

@andygrove
Copy link
Contributor Author

This was going fine until I ran into this compilation error:

[ERROR] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark330db/scala/com/nvidia/spark/rapids/shims/SparkShims.scala:70: type arguments [org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand] do not conform to method dataWriteCmd's type parameter bounds [INPUT <: org.apache.spark.sql.execution.command.DataWritingCommand]
[ERROR]     Seq(GpuOverrides.dataWriteCmd[CreateDataSourceTableAsSelectCommand](

CreateDataSourceTableAsSelectCommand no longer implements DataWritingCommand.

nm, this changed in 3.4

andygrove and others added 9 commits May 31, 2023 14:18
…pids/delta/shims/InvariantViolationExceptionShim.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
…transaction/tahoe/rapids/GpuOptimisticTransaction.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
…hims/SparkShims.scala

Co-authored-by: Jason Lowe <jlowe@nvidia.com>
This commit fixes a compile failure on Databricks, introduced in NVIDIA#8240.

Partition.files is a `SerializableFileStatus` on Databricks, as opposed
to a `FileStatus` on Apache Spark.

This commit solves the problem, without Shims.

Signed-off-by: MithunR <mythrocks@gmail.com>
@andygrove
Copy link
Contributor Author

build

@NvTimLiu
Copy link
Collaborator

NvTimLiu commented Jun 1, 2023

I got the shuffle_test failure as below on Databricks 12.2 runtime only using latest update of this PR, is this a known issue? @andygrove

"update_url": "https://clients2.google.com/service/update2/crx",
.444Z] + exec python /home/ubuntu/spark-rapids/integration_tests/runtests.py --/spark-rapids/integration_tests /home/ubuntu/spark-rapids/integration_tests/src/-rfE '' --std_input_path=/home/ubuntu/spark-rapids/integration_tests/src/test/s '' --junitxml=TEST-pytest-1685618364276402574.xml -m shuffle_test --ks --test_type=nightly
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use ).
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use ).
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use ).
 Setting default log level to "WARN".
 To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use ).
 INFO     Initializing findspark because running with xdist worker
 INFO     Checking if add_jars/packages to findspark required
 INFO     SPARK_EVENTLOG_ENABLED is ignored for non-local Spark master and when by the user
 INFO     Initializing findspark because running with xdist worker
 INFO     Checking if add_jars/packages to findspark required
 INFO     SPARK_EVENTLOG_ENABLED is ignored for non-local Spark master and when by the user
njection seed: 1685618364. Set env variable SPARK_RAPIDS_TEST_INJECT_OOM_SEED to 
 INFO     Executing global initialization tasks before test launches
 INFO     Creating directory /home/ubuntu/spark-rapids/integration_tests/target/24-zkgo/hive with permissions 0o777
 INFO     Skipping findspark init because on xdist master
 INFO     Initializing findspark because running with xdist worker
 INFO     Checking if add_jars/packages to findspark required
 INFO     SPARK_EVENTLOG_ENABLED is ignored for non-local Spark master and when it's pre-configured by the user
 ============================= test session starts ==============================
 platform linux -- Python 3.8.10, pytest-7.3.1, pluggy-1.0.0 -- /usr/bin/python
 cachedir: .pytest_cache
 rootdir: /home/ubuntu/spark-rapids/integration_tests
 configfile: pytest.ini
 plugins: order-1.1.0, xdist-3.3.1
 created: 4/4 workers
 INTERNALERROR> def worker_internal_error(self, node, formatted_error):
 INTERNALERROR>         """
 INTERNALERROR>         pytest_internalerror() was called on the worker.
 INTERNALERROR>     
 INTERNALERROR>         pytest_internalerror() arguments are an excinfo and an excrepr, which can't
 INTERNALERROR>         be serialized, so we go with a poor man's solution of raising an exception
 INTERNALERROR>         here ourselves using the formatted message.
 INTERNALERROR>         """
 INTERNALERROR>         self._active_nodes.remove(node)
 INTERNALERROR>         try:
 INTERNALERROR> >           assert False, formatted_error
 INTERNALERROR> E           AssertionError: Traceback (most recent call last):
 INTERNALERROR> E               File "/home/ubuntu/.local/lib/python3.8/site-packages/_pytest/main.py", line 267, in wrap_session
 INTERNALERROR> E                 config.hook.pytest_sessionstart(session=session)
 INTERNALERROR> E               File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_hooks.py", line 265, in __call__
 INTERNALERROR> E                 return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
 INTERNALERROR> E               File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_manager.py", line 80, in _hookexec
 INTERNALERROR> E                 return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
 INTERNALERROR> E               File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_callers.py", line 60, in _multicall
 INTERNALERROR> E                 return outcome.get_result()
 INTERNALERROR> E               File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_result.py", line 60, in get_result
 INTERNALERROR> E                 raise ex[1].with_traceback(ex[2])
 INTERNALERROR> E               File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_callers.py", line 39, in _multicall
 INTERNALERROR> E                 res = hook_impl.function(*args)
 INTERNALERROR> E               File "/home/ubuntu/spark-rapids/integration_tests/src/main/python/spark_init_internal.py", line 142, in pytest_sessionstart
 INTERNALERROR> E                 _s = _sb.enableHiveSupport() \
 INTERNALERROR> E               File "/databricks/spark/python/pyspark/sql/session.py", line 401, in getOrCreate
 INTERNALERROR> E                 else SparkContext.getOrCreate(sparkConf)
 INTERNALERROR> E               File "/databricks/spark/python/pyspark/context.py", line 555, in getOrCreate
 INTERNALERROR> E                 SparkContext(conf=conf or SparkConf())
 INTERNALERROR> E               File "/databricks/spark/python/pyspark/context.py", line 204, in __init__
 INTERNALERROR> E                 self._do_init(
 INTERNALERROR> E               File "/databricks/spark/python/pyspark/context.py", line 294, in _do_init
 INTERNALERROR> E                 self._jsc = jsc or self._initialize_context(self._conf._jconf)
 INTERNALERROR> E               File "/databricks/spark/python/pyspark/context.py", line 428, in _initialize_context
 INTERNALERROR> E                 return self._jvm.JavaSparkContext(jconf)
 INTERNALERROR> E               File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1585, in __call__
 INTERNALERROR> E                 return_value = get_return_value(
 INTERNALERROR> E               File "/databricks/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
 INTERNALERROR> E                 raise Py4JJavaError(
 INTERNALERROR> E             py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
 INTERNALERROR> E             : java.lang.ClassNotFoundException: com.nvidia.spark.rapids.spark332db.RapidsShuffleManager
 INTERNALERROR> E             	at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
 INTERNALERROR> E             	at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
 INTERNALERROR> E             	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
 INTERNALERROR> E             	at java.lang.Class.forName0(Native Method)
 INTERNALERROR> E             	at java.lang.Class.forName(Class.java:348)
 INTERNALERROR> E             	at org.apache.spark.util.Utils$.classForName(Utils.scala:263)
 INTERNALERROR> E             	at org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:3053)
 INTERNALERROR> E             	at org.apache.spark.SparkEnv$.create(SparkEnv.scala:440)
 INTERNALERROR> E             	at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:312)
 INTERNALERROR> E             	at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:345)
 INTERNALERROR> E             	at org.apache.spark.SparkContext.<init>(SparkContext.scala:605)
 INTERNALERROR> E             	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:60)
 INTERNALERROR> E             	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 INTERNALERROR> E             	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
 INTERNALERROR> E             	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 INTERNALERROR> E             	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
 INTERNALERROR> E             	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
 INTERNALERROR> E             	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
 INTERNALERROR> E             	at py4j.Gateway.invoke(Gateway.java:257)
 INTERNALERROR> E             	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
 INTERNALERROR> E             	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
 INTERNALERROR> E             	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:195)
 INTERNALERROR> E             	at py4j.ClientServerConnection.run(ClientServerConnection.java:115)
 INTERNALERROR> E             	at java.lang.Thread.run(Thread.java:750)
 INTERNALERROR> E           assert False
 INTERNALERROR> 
 INTERNALERROR> ../../../../.local/lib/python3.8/site-packages/xdist/dsession.py:197: AssertionError
 [gw0] node down: Not properly terminated
 replacing crashed worker gw0
 INTERNALERROR> Traceback (most recent call last):
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/_pytest/main.py", line 269, in wrap_session
 INTERNALERROR>     session.exitstatus = doit(config, session) or 0
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/_pytest/main.py", line 323, in _main
 INTERNALERROR>     config.hook.pytest_runtestloop(session=session)
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_hooks.py", line 265, in __call__
 INTERNALERROR>     return self._hookexec(self.name, self.get_hookimpls(), kwargs, firstresult)
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_manager.py", line 80, in _hookexec
 INTERNALERROR>     return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_callers.py", line 60, in _multicall
 INTERNALERROR>     return outcome.get_result()
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_result.py", line 60, in get_result
 INTERNALERROR>     raise ex[1].with_traceback(ex[2])
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/pluggy/_callers.py", line 39, in _multicall
 INTERNALERROR>     res = hook_impl.function(*args)
 INTERNALERROR>   File "/home/ubuntu/.local/lib/python3.8/site-packages/xdist/dsession.py", line 122, in pytest_runtestloop
 INTERNALERROR>     self.loop_once()
 INFO     Initializing findspark because running with 
 INFO     Checking if add_jars/packages to findspark 
 INFO     SPARK_EVENTLOG_ENABLED is ignored for non-n it's pre-configured by the 
 level to "WARN".
evel use sc.setLogLevel(newLevel). For SparkR, use 
 INFO     Initializing findspark because running with 
 INFO     Checking if add_jars/packages to findspark 
 INFO     SPARK_EVENTLOG_ENABLED is ignored for non-n it's pre-configured by the 
 ERROR    KeyboardInterrupt while sending command.

@abellina
Copy link
Collaborator

abellina commented Jun 1, 2023

@andygrove we need to add a shim for RapidsShuffleManager for spark322db. This is why we added the test, because last time we merged a shim and we didn't know it (@nartal1 for the win!!)

@andygrove
Copy link
Contributor Author

Thanks @NvTimLiu and @abellina. The shuffle integration tests are now passing for me with the latest commit.

@abellina
Copy link
Collaborator

abellina commented Jun 1, 2023

@andygrove change looks good to me.

@gerashegalov
Copy link
Collaborator

re: RapidsShuffleManager, ordinarily It should have been caught by a scala test SparkShimSuite but we haven't prioritized making them runable on DBR #1320

@andygrove
Copy link
Contributor Author

build

@andygrove
Copy link
Contributor Author

@revans2 @jlowe This is ready for another round of reviews when you have time

@@ -131,6 +131,7 @@ def generate_dest_data(spark):
@pytest.mark.parametrize("use_cdf", [True, False], ids=idfn)
@pytest.mark.parametrize("partition_columns", [None, ["a"]], ids=idfn)
@pytest.mark.skipif(is_before_spark_320(), reason="Delta Lake writes are not supported before Spark 3.2.x")
@pytest.mark.skipif(is_databricks122_or_later(), reason="https://github.com/NVIDIA/spark-rapids/issues/8423")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically we want all these to be xfail'd tests rather than skipped. We xfail tests that are failing due to bugs and should work at some point in the future when those bugs are fixed. We skip tests that we never expect to work. Definitely not must-fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, but I didn't dig too deeply into the delta code because there is a follow on issue for that.

@andygrove
Copy link
Contributor Author

build

1 similar comment
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove merged commit 2ed3523 into NVIDIA:branch-23.08 Jun 2, 2023
@andygrove andygrove deleted the 332db branch June 2, 2023 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add initial support for Databricks 12.2 ML LTS
9 participants