Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding git and databricks documentation #41

Merged
merged 6 commits into from
Apr 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
234 changes: 234 additions & 0 deletions ADA/git_databricks.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,234 @@
---
title: "Databricks and version control"
---

<p class="text-muted">

Guidance for analysts on how to connect Databricks to Github and Azure DevOps

</p>

------------------------------------------------------------------------

jen-machin marked this conversation as resolved.
Show resolved Hide resolved
## Why should I use git with Databricks?

For analytical projects developed on Databricks in the DfE, we recommend using git and Azure DevOps for sharing, collaborating and proper version control of work.


### Easier collaboration

------------------------------------------------------------------------

When you're working on notebooks in Databricks without a git connection, they tend to be saved in your own personal Workspace. This means that they are only accessible by you, unless you share them. This has the potential to cause issues if someone needs to run any code you have stored in a notebook within Databricks e.g. when you leave your current team, take non-working days or annual leave, team members will not be able to access any of your code stored in your notebooks on Databricks. If the notebooks are stored in a Github or Azure DevOps repo, your team can all access them by cloning the repo in Databricks.

\


------------------------------------------------------------------------

### Version control

------------------------------------------------------------------------

Databricks autosaves your notebooks as you're working on them, making version control more difficult. If you use git, you'll be able to see the full version history of your work and easily roll back to older versions if you need to. It also significantly helps simplify switching between different methodologies as well as facilitating proper code QA and review.

\

------------------------------------------------------------------------

## Github or Azure DevOps?


Either Github or Azure DevOps repos can be used in connection with Databricks, but we advise you to follow the guidance relating to public and private repos in our [What is git for?](https://dfe-analytical-services.github.io/analysts-guide/learning-development/git.html#what-is-git-for) section.

\

------------------------------------------------------------------------

## Is it safe?


The connection between git and Databricks is established using a secure access token from your git account, which means it is safe. You will need to renew your access token after a given period of time - if you do not, then your connection between git and Databricks will no longer work.

Additionally, when you commit and push notebooks through the Databricks interface, it will automatically clear any output cells from your notebook, meaning that you cannot accidentally include any unpublished data in there.

\

------------------------------------------------------------------------

# Setting up a connection to Azure DevOps

------------------------------------------------------------------------

## Prerequisites


- An Azure DevOps account and access to the repo you need to connect to
- A Databricks account and access to your notebooks

\

------------------------------------------------------------------------

## Getting set up


### Access Tokens


Access tokens are long strings of numbers and letters that act like a password between two services. They identify the user and their permissions from one service to another.

In this case, we will generate an access token in DevOps and give it to Databricks. To generate your Azure DevOps access token, go to DevOps and click the user settings icon to the left of your initials in the top right of your screen - it looks like a person with a cog next to them:

![](../images/devops-user-settings.PNG)

In the user settings menu, click Personal Access Tokens. On the screen that appears, click the blue "New Token" button in the top right:

![](../images/devops-new-token.PNG)

The following window should appear:

![](../images/devops-create-token.PNG)

- Give the token a sensible name so that you can identify it in your list of tokens.
- The organisation field should be pre-populated for you, this can be left alone.
- You can modify the expiration date. Every time the token expires, you will have to follow this guidance again to set up a new token and reconnect to git, so choose something sensible.
- You should make sure that "Full access" is selected under the Scopes heading.

When you have entered all of the required information, click "Create" at the bottom of the window.

A "Success!" window will open containing your new access token. **You must copy the token immediately, as you will not be able to access it again once this window is closed.** If you fail to copy the token, you will have to regenerate a new one.

\

------------------------------------------------------------------------

### Connecting to Databricks


Now that you have your access token, you should go straight to Databricks. In the top right corner of the Databricks window, click your username and then "User Settings":

![](../images/databricks-user-settings.PNG)

You should then select "Linked Accounts" from the User menu on the left. The following page will open:

![](../images/databricks-linked-accounts.PNG)

- We recommend immediately pasting your copied access token into the "Token" field at the bottom of the page to avoid losing it.
- Git provider should be set as "Azure DevOps Services (Personal Access Token)"
- Enter your email into the "Git provider username or email" field

You can then click Save at the bottom of the page, and now your connection between Azure DevOps and Databricks is established!

\


------------------------------------------------------------------------

### Connecting to repos and cloning


Just like any other way that you've worked with git before, the first step is going to be to clone your repo inside Databricks.

In the blue menu on the left, click Workspace, and then in the Workspace menu that appears, click Repos. The menus should look like this:

![](../images/databricks-workspace-menu.PNG)

On the Repos screen, click the grey "Add repo" button in the top right corner. The "Add Repo" window will appear:

![](../images/databricks-add-repo.PNG)

- You will first need to go to Azure DevOps and copy the link to clone the repo as you usually would
- Paste this link into the "Git repository URL" field
- Under "Git Provider", select "Azure DevOps Services"
- The "Repository name" field should be automatically populated when you enter the URL.

You can then click "Create Repo". When the repo is created, you will be able to see it under your name in the Repos menu:

![](../images/databricks-view-repo.PNG)

#### Sparse checkout and Cone Patterns

Databricks cannot clone very large repos. It is best practice not to have a repo of this size, but if you attempt to clone a large repo, you will receive an error message. In this instance, you will need to perform a sparse checkout. This only clones a selection of items in your repo. To do this, select the "Advanced" option when in the "Add repo" menu, and tick the "Sparse checkout mode" option. You can tell Databricks which elements of the repo you would like it to clone, and you can do this by specifying something called Cone Patterns. If you do not specify any Cone Patterns, then the Databricks default is to clone only files in the root folder, but none from any subdirectories. To specify your own Cone Patterns, enter them in the box provided. You can enter multiple patterns but if the folders that they refer to contain more than 800MB of data then your clone will continue to fail.

You might define your Cone Patterns as something like:

`folder_a`
`folder_b/subfolder_c`

The second example will only clone subfolder_c and not the rest of the contents of folder_b.

Please note: You cannot currently disable sparse checkout mode once it is enabled, but you can modify Cone Patterns. If you create a new folder, you must add it to your Cone Pattern list before you can commit and push. You can find more information on sparse checkout and Cone Patterns on the [Databricks website](https://docs.databricks.com/en/repos/git-operations-with-repos.html).

\


------------------------------------------------------------------------

# Working in repos in Databricks

------------------------------------------------------------------------

## Folders in Databricks


To be able to add your notebooks to a repo, you need to make sure that you save them in the correct place.
In Databricks, you have your Workspace. Inside your Workspace is your "local" Users folder, and your repos. In the image below, the User folder is highlighted blue:

![](../images/databricks-folders.PNG)

You can think of your User folder as being a bit like "My Documents" on your laptop. When you save things there, only you can see and access them. Git cannot "see" things inside your User folder. However, git **can** see things inside your Repos folder. This means that when you save something in your Repos folder, it can be committed and pushed to git. It is therefore good practice to always save notebooks in your Repos folder instead of in your User folder.

\

------------------------------------------------------------------------

## Git pull, commit, and push in Databricks


### Git pull


You can access the menu to pull, commit and push from several places within Databricks. This interface is the same whether you're working with a DevOps repo or a Github repo.

At the top of any notebook that's saved in a repo, you'll see a little grey branch icon with the name of a repo next to it. In the case of this notebook, it's on the main branch:

![](../images/databricks-main-repo-notebook.PNG)

You can also access the git menu by clicking the branch name next to the repo name within the Repos folder:

![](../images/databricks-main-repo-menu.PNG)

If you click the branch name, it'll open the git interface within Databricks:

![](../images/databricks-git-interface.PNG)

From here, you can perform git pull by clicking the pull icon in the top right. There is a dropdown menu in the top left corner that allows you to change the branch or create a new branch if required.

\

------------------------------------------------------------------------

### Git commit and push


When you have made changes to a notebook, it will appear in the Changes section of the git interface. You can also see the actual changes that have been made in the right hand box to make sure that you're committing the correct file:

![](../images/databricks-git-commit.PNG)

In Databricks, you commit and push as one action, rather than as two separate ones. Enter your commit message into the "Commit message" box (you can ignore the Description box) and click the "Commit & Push" button.

------------------------------------------------------------------------

### Additional git commands in the Databricks interface

You can access additional git features such as merge, rebase and reset directly within the Databricks interface by clicking the 3 dots in the menu as shown in the image below:

![](../images/reset-local-branch.PNG)

When merging, you are also able to resolve merge conflicts inside Databricks itself.

**Please be warned** that we would usually advise against using git reset as you can permanently lose changes you've made and this cannot be undone.

There is additional guidance on each of these advanced features in the [Databricks manual](https://docs.databricks.com/en/repos/git-operations-with-repos.html). Please use git reset with caution as you can easily lose recent changes when performing this action.

2 changes: 1 addition & 1 deletion RAP/rap-support.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Learning resources and materials for [SQL](../learning-development/sql.html), [R

* In person workshops covering specific technical skills in practice
* 3 hours long, with people working in small groups
* We currently offer two workshops: introduction to git and DevOps, and introduction to R and RAP. We can travel to your site to deliver these.
* We currently offer two workshops: introduction to git and Azure DevOps, and introduction to R and RAP. We can travel to your site to deliver these.
* Contact [statistics.development@education.gov.uk](mailto:statistics.development@education.gov.uk) to register interest in workshops happening at your site or to request new topics!

We’re also considering a dedicated G6 / G7 programme to build confidence and set expectations. This may include:
Expand Down
1 change: 1 addition & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ website:
contents:
- ADA/ada.qmd
- ADA/databricks_rstudio.qmd
- ADA/git_databricks.qmd

format:
html:
Expand Down
Binary file added images/databricks-add-repo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/databricks-folders.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/databricks-git-commit.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/databricks-git-interface.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/databricks-linked-accounts.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/databricks-main-repo-menu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/databricks-main-repo-notebook.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/databricks-user-settings.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/databricks-view-repo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/databricks-workspace-menu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/devops-create-token.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/devops-new-token.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/devops-user-settings.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/reset-local-branch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion learning-development/learning-support.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Our current workshops are:

In this workshop, we cover:

* What are git, DevOps and GitHub and what are the differences?
* What are git, Azure DevOps and GitHub and what are the differences?
* How can they help your work?
* How to start working with git and DevOps in the DfE ecosystem
* Managing projects and tasks with DevOps Boards
Expand Down
2 changes: 1 addition & 1 deletion statistics-production/embedded-charts.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Prior to publication, either dummy data or the already published data should be
If you need a live version of the dashboard with the unpublished data for pre-release reviews and access, the following options are available:

* Demo the chart in R-Studio
* Create a copy of the chart repository on DevOps, deploy this to rsconnect and use the rsconnect link as the embedded URL prior to publication (note this will need updating to the public ShinyApps link before the publication goes live).
* Create a copy of the chart repository on Azure DevOps, deploy this to rsconnect and use the rsconnect link as the embedded URL prior to publication (note this will need updating to the public ShinyApps link before the publication goes live).

We are currently putting in place a case to provide an internal shiny server platform, which will allow greater control of security around our data and allow draft Shiny applications to use unpublished data for internal use.

Expand Down