How to share an HDInsight Hive Metastore with Azure Databricks
- Create a SQL Database (PaaS) server in Azure
- Create an empty database in under the server
- Create an HDInsight cluster in Azure and point the external Hive metastore to your database
- Delete the HDInsight cluster after it has been created
- Create a Databricks workspace in Azure
- Launch the workspace and open a notebook
- Run Step 1 in a cell
- Run Step 2 in a cell (replace your server name, database name, username and password)
- Terminate your Databricks cluster
- Restart your cluster
- In a SQL notebook you can run SHOW TABLES (you should see the sample table from HDInsight)
- NOTE: You will not be able to select the data from this table. The table is probably pointing to blob or ADLS storage. You should create all your tables in HDI/Databricks as EXTERNAL and makes sure you LOCATION property is set to blob or ADLS. You Databricks cluster will also need to be authorized to access ADLS or blob.
dbutils.fs.mkdirs("dbfs:/databricks/init/")
dbutils.fs.put(
"/databricks/init/external-metastore.sh",
"""#!/bin/sh
|# Loads environment variables to determine the correct JDBC driver to use.
|source /etc/environment
|# Quoting the label (i.e. EOF) with single quotes to disable variable interpolation.
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark.conf
|[driver] {
| # Hive specific configuration options for metastores in the local mode.
| "spark.hadoop.javax.jdo.option.ConnectionURL" = "jdbc:sqlserver://<<FILLL-ME-IN-SERVERNAME>>.database.windows.net:1433;database=<<FILLL-ME-IN-DATABASE-NAME>>;encrypt=true;trustServerCertificate=true;create=false;loginTimeout=300"
| "spark.hadoop.javax.jdo.option.ConnectionUserName" = "<<FILLL-ME-IN-USER>>"
| "spark.hadoop.javax.jdo.option.ConnectionPassword" = "<<FILLL-ME-IN-PASSWORD>>"
| "hive.metastore.schema.verification.record.version" = "true"
| "spark.sql.hive.metastore.jars" = "maven"
| "hive.metastore.schema.verification" = "true"
| "spark.sql.hive.metastore.version" = "2.1.1"
|EOF
|# Add the JDBC driver separately since must use variable expansion to choose the correct
|# driver version.
|cat << EOF >> /databricks/driver/conf/00-custom-spark.conf
| "spark.hadoop.javax.jdo.option.ConnectionDriverName" = "com.microsoft.sqlserver.jdbc.SQLServerDriver"
|}
|EOF
|""".stripMargin,
overwrite = true
)