-
Notifications
You must be signed in to change notification settings - Fork 121
Connecting Cosmos DB with PowerBI using spark and databricks premium
A powerfully fun way to visualize your data in Azure Cosmos DB is to use Power BI. While there is an ODBC Driver (refer to Connect to Azure Cosmos DB using BI analytics tools with the ODBC driver), this method requires you to download all of the data from Azure Cosmos DB into Power BI.
To workaround this issue, one technique is to use the azure-cosmosdb-spark
connector which allows you to use Apache Spark as the bridge between Power BI and Azure Cosmos DB. Power BI has direct query capabilities to Apache Spark and with the azure-cosmosdb-spark
connector, you can create direct connectivity from Power BI to Azure Cosmos DB.
Note, these are alpha working instructions and we will over time simplify how to do this so it will be easier for you to configure this.
You will need the following components
- Power BI Desktop,
- A databricks premium cluster Azure Databricks Spark, and
- an Azure Cosmos DB subscription.
-
Create a databricks cluster
-
Attach the Cosmos DB connector to the data bricks cluster
Now that your Databricks cluster can automatically connect to your Azure Cosmos DB cluster, you will need to build a table - the source of which is a Cosmos DB query. This way from the Power BI perspective, it is only connecting to a Spark SQL table (not realizing that it is a Cosmos DB data source underneath).
The easiest way to do this is to create a notebook that creates the sql table
you can run a command to create an external table such as the one below:
spark.sql("create table flights using com.microsoft.azure.cosmosdb.spark options (endpoint 'https://doctorwho.documents.azure.com:443/', database 'DepartureDelays', collection 'flights_pcoll', masterkey '$masterkey')")
Thus if you run the show tables
command, it should look something like this:
spark.sql("show tables").show()
+---------------+-----------+
| tableName|isTemporary|
+---------------+-----------+
| flights| false|
|hivesampletable| false|
+---------------+-----------+
And now you can test it by running Spark SQL queries against it, such as:
-- Test the table works
spark.sql("select * from flights limit 10").show()
spark.sql("select date, delay, distance, origin, destination from flights limit 10").show()
-- Test a group by statement
spark.sql("select destination, count(1) from flights where origin = 'SEA' group by destination").show()
Now, you can connect Power BI to Spark as noted in the instructions Apache Spark BI using data visualization tools with Azure Databricks.
The only difference is now you will connect to a table (such as flights
above) where its underlying source is an Azure Cosmos DB query.
Based on the initial tests, here are some quick tips:
- Because Power BI can ask a lot of queries (especially if you have a lot of widgets), you may want to shrink the scope of the query by creating a view. For example, if I only care about flights departing from Seattle, I can create the view below that limits Power BI to only request for flights departing Seattle. You're reducing the amount of data transferred as well as taking advantage of Cosmos DB indexing at the same time.
spark.sql("CREATE VIEW originSEA AS SELECT date, distance, delay, origin, destination FROM flights WHERE origin = 'SEA'")
Please do not hesitate to submit issues, suggest any PRs, or contact us at AskCosmosDB@microsoft.com.