Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support for Spark Connect (aka Delta Connect) #3240

Open
1 of 5 tasks
allisonport-db opened this issue Jun 7, 2024 · 3 comments
Open
1 of 5 tasks
Labels
enhancement New feature or request
Milestone

Comments

@allisonport-db
Copy link
Collaborator

Feature request

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Overview

Spark Connect is a new initiative in Apache Spark that adds a decoupled client-server infrastructure which allows Spark applications to connect remotely to a Spark server and run SQL / Dataframe operations. We want to develop what we're calling "Delta Connect" to allow Delta operations to be made in applications running in such client-server mode.

Further details

These are the CUJs we would like to support:

Server

The server is packaged into the io.delta:delta-spark-connect-server_2.13 package, installing this package automatically installs the io.delta:delta-spark-connect-common_2.13 package.

sbin/start-connect-server.sh \
  --packages org.apache.spark:spark-connect_2.13:4.0.0, io.delta:delta-spark-connect-server_2.13:4.0.0 \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"

Scala Client

The client is packaged into the io.delta:delta-spark-connect-client_2.13 package, installing this package automatically installs the io.delta:delta-spark-connect-common_2.13 package.

export SPARK_REMOTE="sc://localhost:15002"
spark-connect-repl --packages io.delta:delta-spark-connect-client_2.13:4.0.0

The delta-spark-connect-client_2.13 package uses the exact same class and package names as the delta-spark_2.13 package. Therefore the exact same code can be used as before.

import io.delta.tables._
import org.apache.spark.sql.functions._

val deltaTable = DeltaTable.forName(spark, "my_table")

deltaTable.update(
  condition = expr("id % 2 == 0"),
  set = Map("id" -> expr("id + 100")))

Python Client

The Delta Connect Python client is included in the same PyPi package as Delta Spark.

pip install pyspark==4.0.0
pip install delta-spark==4.0.0

There is no difference in usage compared to the classic way. We just need to pass in a remote SparkSession (instead of a local one) to the DeltaTable API.

from delta.tables import DeltaTable
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.remote("sc://localhost:15002").getOrCreate()

deltaTable = DeltaTable.forName(spark, "my_table")

deltaTable.update(
  condition = expr("id % 2 == 0"),
  set = { "id": expr("id + 100") })
@dmatrix
Copy link
Contributor

dmatrix commented Dec 12, 2024

@allisonport-db There are A couple of issues with the above commands for starting the connect server.

 sbin/start-connect-server.sh \
  --packages org.apache.spark:spark-connect_2.13:4.0.0, io.delta:delta-spark-connect-server_2.13:4.0.0 \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
```
The package ```org.apache.spark:spark-connect_2.13:4.0.0``` does not exist in the Spark-4.0.preview-1 download. As a result, you'll never be able to launch the connect server locally.
jars files in the ```spark-4.0.0-preview1-bin-hadoop3/jars:
`spark-catalyst_2.13-4.0.0-preview1.jar
spark-common-utils_2.13-4.0.0-preview1.jar
spark-core_2.13-4.0.0-preview1.jar
spark-graphx_2.13-4.0.0-preview1.jar
spark-hive-thriftserver_2.13-4.0.0-preview1.jar
spark-hive_2.13-4.0.0-preview1.jar
spark-kubernetes_2.13-4.0.0-preview1.jar
spark-kvstore_2.13-4.0.0-preview1.jar
spark-launcher_2.13-4.0.0-preview1.jar
spark-mllib-local_2.13-4.0.0-preview1.jar
spark-mllib_2.13-4.0.0-preview1.jar
spark-network-common_2.13-4.0.0-preview1.jar
spark-network-shuffle_2.13-4.0.0-preview1.jar
spark-repl_2.13-4.0.0-preview1.jar
spark-sketch_2.13-4.0.0-preview1.jar
spark-sql-api_2.13-4.0.0-preview1.jar
spark-sql_2.13-4.0.0-preview1.jar
spark-streaming_2.13-4.0.0-preview1.jar
spark-tags_2.13-4.0.0-preview1.jar
spark-unsafe_2.13-4.0.0-preview1.jar
spark-variant_2.13-4.0.0-preview1.jar
spark-yarn_2.13-4.0.0-preview1.jar
```

Second, the pip install command should have specific rc1 number for now:
```
pip install pyspark==4.0.0.dev1
pip install delta-spark==4.0.0rc1`

@dmatrix
Copy link
Contributor

dmatrix commented Dec 12, 2024

This command from

sbin/start-connect-server.sh \
  --packages org.apache.spark:spark-connect_2.13:4.0.0-preview1,io.delta:delta-connect-server_2.13:4.0.0rc1,io.delta:delta-connect-common_2.13:4.0.0rc1,com.google.protobuf:protobuf-java:3.25.1 \
  --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
  --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \
  --conf "spark.connect.extensions.relation.classes=org.apache.spark.sql.connect.delta.DeltaRelationPlugin" \
  --conf "spark.connect.extensions.command.classes=org.apache.spark.sql.connect.delta.DeltaCommandPlugin"```

will start the spark connect server on local host. Thereafter, you should be able to use delta-spark API on DF operations.

@RickLeite
Copy link

Can't wait for this! ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants