Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new guidance page for setting up personal cluster (ADA) with R/… #51

Merged
merged 2 commits into from
May 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions ADA/databricks_rstudio_personal_cluster.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
---
title: "Setup Databricks personal compute cluster with RStudio"
---

<p class="text-muted">

The following instructions set up an ODBC connection between your laptop and your DataBricks cluster, which can then be used in R/RStudio to query data using an ODBC based package or `sparklyr`. Personal clusters are able to run SQL, R, python and scala. They can be used within the DataBricks environment, or through R studio and can be set up yourself if you don't have access to a SQL warehouse or shared cluster.

</p>

------------------------------------------------------------------------

# Pre-requisites

------------------------------------------------------------------------

You must have:

- Access to Databricks
- Access to a personal cluster on DataBricks
- R and RStudio downloaded

------------------------------------------------------------------------

# Downloading an ODBC driver

------------------------------------------------------------------------

1. Install the 'Simba Spark ODBC' driver from the software centre.

i) Open the Software Centre via the start menu.

ii) In the 'Applications' tab, click 'Simba Spark ODBC Driver 64-bit'. ![](../images/databricks-software-centre.png)

iii) Click install.

2. Get connection details for the cluster from Databricks. To set up the connection you will need a few details from your cluster within DataBricks.

i) Login to [Databricks](https://adb-6882499576863257.17.azuredatabricks.net/?o=6882499576863257)

ii) Click on the 'Compute' tab in the sidebar. ![](../images/databricks-compute.png)

iii) Click on the name of the cluster you want to connect to, and click the 'Advanced options' at the bottom of the cluster page.

iv) Click the 'JDBC/ODBC' tab under 'Advanced options'

v) Make a note of the 'Server hostname', 'Port', and 'HTTP Path'.

3. Get a personal access token from Databricks for authentication.

i) In Databricks, click on your email address in the top right corner, then click 'User settings'.

ii) Go to the 'Developer' tab in the side bar. Next to 'Access tokens', click the 'Manage' button. ![](../images/databricks-access-tokens.png)

iii) Click the 'Generate new token' button.

iv) Name the token, then click 'Generate'. *Note that access tokens will only last as long as the value for the 'Lifetime (days)' field. After this period the token will expire, and you will need to create a new one to re-authenticate.*

v) Make a note of the 'Databricks access token' it has given you. **It is important to copy this somewhere as you will not be able to see it through Databricks again.**

4. Setup ODBC connection from your laptop. We now have all the information we need to setup a connection between our laptop and DataBricks.

i) In the start menu, search 'ODBC' and open 'ODBC Data Sources (64-bit)'.

ii) On the 'User DSN' tab click the 'Add...' button.

iii) In the 'Create New Data Source' window, select 'Simba Spark ODBC Driver' and click 'Finish'.

iv) In the 'Simba Spark ODBC Driver DSN Setup' window,

a. Enter a 'Data Source Name' and 'Description'. Choose a short and sensible data source name and note it down as this is what you will use to connect to Databricks through RStudio. *As you can set up more than one cluster on Databricks, use the description to make clear which cluster this connection is for. The description shown below describes that this connection is using an 8 core cluster on Databricks Runtime Environment 13.*
b. Set the remaning options to the settings below.
- Enter the 'Server Hostname' for your cluster in the 'Host(s):' field (you noted this down in step 2).
- In the Port section, remove the default number and use the Port number you noted in step 2.
- Set the Authentication Mechanism to 'User Name and Password'.
- Enter the word 'token' into the 'User Name:' field, then enter your 'Databricks access token' in the 'Password:' field.
- Change the Thrift Transport option to HTTP.
- Click the 'HTTP Options...' button and enter the 'HTTP Path' of your Databricks cluster, then click 'Okay'.
- Click the 'SSL Options...' button and tick the 'Enable SSL' box, then click the 'OK' button.
- ![](../images/odbc-driver-settings.png)
c. Click the 'Test' button to verify the connection has worked. You should see the following message. *If you get an error here, repeat steps 5.e.i -- 5.e.ix again and ensure all the values are correct.*

![](../images/databricks-test-connection.png)

d. Click the 'OK' button to exit the 'Test Results' window, then the 'OK' button in the 'Simba Spark ODBC Driver DSN Setup' window.

5. Connect through RStudio. Watch the below video and view the [ADA_RStudio_connect GitHub repo](https://github.com/dfe-analytical-services/ADA_RStudio_connect) for methods on connecting to Databricks and querying data from RStudio.

------------------------------------------------------------------------

# Pulling data into R studio from Databricks

------------------------------------------------------------------------

Once you have set up an ODBC connection as detailed above, you can then use that connection to pull data directly from Databricks into R Studio. Charlotte recorded a video demonstrating two possible methods of how to do this. The recording is embedded below:

::: {align="center"}
<iframe src="https://educationgovuk.sharepoint.com/sites/lvewp00086/_layouts/15/embed.aspx?UniqueId=5fad039e-a763-40c8-8ea0-403aea712f4c&amp;embed=%7B%22ust%22%3Atrue%2C%22hv%22%3A%22CopyEmbedCode%22%7D&amp;referrer=StreamWebApp&amp;referrerScenario=EmbedDialog.Create" width="640" height="360" frameborder="0" scrolling="no" allowfullscreen title="ADA_Rstudio_2.mp4">

</iframe>
:::

A template of all of the code used in the above video can be found in the [ADA_RStudio_connect GitHub repo](https://github.com/dfe-analytical-services/ADA_RStudio_connect).

Key takeaways from the video and example code:

- The main change here compared to connecting to SQL databases is the connection method. The installation and setup of the ODBC driver are all done pre-code, and the only part of the code that will need updating is your connection (usually your con variable).
- If your existing code was pulling in tables from SQL via the `RODBC` package or the `dbplyr` package, then this code should in theory run with minimal edits needed.
- If you were writing tables back into SQL from R, this is where your code may need the most edits.
- If your code is stored in a repo where multiple analysts contribute to and run the code, in order for the code to run for everyone you will all need to individually install the ODBC driver and **give it the same name** so that when the `con` variable is called, the name used in the code matches everyone's individual driver and runs for everyone. If this is the case, please **add a note about this to your repo's readme file** to help your future colleagues.
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
---
title: "Setup Databricks with RStudio"
title: "Setup Databricks SQL Warehouse with RStudio"
---

<p class="text-muted">The following instructions set up an ODBC connection between your laptop and your DataBricks cluster, which can then be used in R/RStudio to query data using an ODBC based package.</p>
<p class="text-muted">The following instructions set up an ODBC connection between your laptop and your DataBricks SQL warehouse, which can then be used in R/RStudio to query data using an ODBC based package.
SQL Warehouses are able to run SQL and can be used within the DataBricks environment, or through RStudio to run SQL code. The `sparklyr` package can also be used from within RStudio (only) as it converts the request for data to Spark SQL behind the scenes.</p>

------------------------------------------------------------------------

Expand Down
3 changes: 2 additions & 1 deletion _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,8 @@ website:
- section: "ADA and Databricks"
contents:
- ADA/ada.qmd
- ADA/databricks_rstudio.qmd
- ADA/databricks_rstudio_sql_warehouse.qmd
- ADA/databricks_rstudio_personal_cluster.qmd
- ADA/git_databricks.qmd

format:
Expand Down
Binary file added images/databricks-compute.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 6 additions & 2 deletions index.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -97,8 +97,12 @@ We hope it can prove a useful community driven resource for everyone from the mo
[Analytical Data Access (ADA) and Databricks](ADA/ada.html)
- Guidance for analysts on how to interact with and use data stored in ADA using Databricks

[Setup Databricks with RStudio](ADA/databricks_rstudio.html)
- Guidance for analysts on how to connect to Databricks from RStudio.
[Setup Databricks SQL Warehouse with RStudio](ADA/databricks_rstudio_sql_warehouse.html)
- Guidance for analysts on how to connect to a Databricks SQL Warehouse from RStudio.

[Setup Databricks personal cluster Warehouse with RStudio](ADA/databricks_rstudio_personal_cluster.html)
- Guidance for analysts on how to connect to a Databricks personal cluster from RStudio.


## Contact us

Expand Down
2 changes: 1 addition & 1 deletion learning-development/r.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -627,6 +627,6 @@ That command should return the path that you entered for Chrome.

## Using R with ADA / Databricks

See our [guidance on how to connect to Databricks from R Studio](../ADA/databricks_rstudio.html)
See our [guidance on how to connect to Databricks SQL Warehouse from R Studio](../ADA/databricks_rstudio_sql_warehouse.html), and [how to connect to Databricks personal cluster from R studio](../ADA/databricks_rstudio_personal_cluster.html)

You can also view example R code using the [Databricks code template notebooks](../ADA/ada.html#databricks-code-templates)