This is an example of how SBT could be used to build project that uses jars from the databricks-connect. Sometimes this is required because Databricks runtime (DBR) sometimes has more functionality than open source spark. But in the most case it's enough to build a project using OSS Spark dependencies, and declare them as provided (shown as example in the build.sbt
, as commented out code).
To build the project following setup steps are necessary:
- Install Databricks Connect & Databricks CLI (better into a separate virtual Python/Conda environment - in this example, with name
test
. For DBR 6.x we must to use Python 3.7!):
conda create --name test python=3.8
conda activate test
pip install -U databricks-connect==7.3 databricks-cli
export DBCONNECT_JARS=$(databricks-connect get-jar-dir)
export SPARK_HOME=$(dirname $DBCONNECT_JARS)
- Configure Databricks Connect as described in documentation - it's required if you want to run the artifact from your machine
- (optional) Configure Databricks CLI as described in documentation
To be able to compile & package this project you need to do following:
- (optional) Setting the location of the DB jars. It could be configured by multiple ways, in order of execution:
- SBT will try to use the path from the environment variable
DBCONNECT_JARS
set above. - try to get path from the
DbConnectJars
system property - that's useful when you're running the project from the IDE. It could be set in the.sbtopts
file, for example, as-DDbConnectJars=....
(execute in the root directory of the project following shell command:echo "-DDbConnectJars=$(databricks-connect get-jar-dir)" |tee -a .sbtopts
to add this definition to the.sbtopts
file) - try to execute
databricks-connect get-jar-dir
if thedatabricks-connect
is in thePATH
- take path from the
jarsPathManual
that is defined in thebuild.sbt
- open it in the editor, and set this variable to the path to jars obtained viadatabricks-connect get-jar-dir
- SBT will try to use the path from the environment variable
- Execute
sbt clean package
to build the project
To run the built project from the command line we need to build the project with sbt package
& execute following command:
$SPARK_HOME/bin/spark-submit --class net.alexott.demos.databricks.SimpleSQL target/scala-2.12/dbconnect-sbt_2.12-0.1.jar
In some cases you need to set environment variable DEBUG_IGNORE_VERSION_MISMATCH
to value 1
because spark-submit
checks if the databricks-connect checks for its version & fails if it doesn't match to the cluster's version. And it could be a problem as databricks-connect isn't published for every DBR version. You just need to make sure that you're using databricks-connect 6.x with DBR 6.x, and similarly for 7.x versions.
The Ultimate version of Intellij Idea has built-in support for submitting Spark jobs via spark-submit, so refer to documentation on how to configure it.
For Community Edition of Idea, we'll need to force the explicit submission of the Spark job by using the SparkSubmit
class. To configure this, go to the "Run" > "Edit Configurations..." and change settings as following:
- Main class:
org.apache.spark.deploy.SparkSubmit
- VM options:
-cp $Classpath$:$SPARK_HOME
- Program arguments:
--class net.alexott.demos.databricks.SimpleSQL target/scala-2.12/dbconnect-sbt_2.12-0.1.jar
- the.jar
file should be built before execution, so it makes sense to hooksbt package
into the "Before launch" configuration - Environment variables:
SPARK_HOME=...
(put the value ofSPARK_HOME
defined above), and maybeDEBUG_IGNORE_VERSION_MISMATCH=1
to allow to run on "incompatible" clusters.
After that you can execute Spark job directly from Idea.