Skip to content

Latest commit

 

History

History
107 lines (80 loc) · 6.94 KB

databricks-workflows.md

File metadata and controls

107 lines (80 loc) · 6.94 KB

Running a dbt project as a job in Databricks Workflows

Databricks Workflows is a highly-reliable, managed orchestrator that lets you author and schedule DAGs of notebooks, Python scripts as well as dbt projects as production jobs.

The capability of running dbt in a Job is currently in private preview. You must be enrolled in the private preview to follow the steps in this guide. Features, capabilities and pricing may change at any time.

In this guide, you will learn how to update an existing dbt project to run as a job, retrieving dbt run artifacts using an API and debug common issues.

Overview

When you run a dbt project as a Databricks Job, the dbt Python process as well as the SQL generated by dbt run on the same Automated Cluster.

If you want to run the SQL on, say, a Databricks SQL endpoint or even another cloud data warehouse, you can customize the checked-in profiles.yml file appropriately (see below).

Prerequisites

  • An existing dbt project version controlled in git
  • Access to a Databricks workspace
  • Ability to launch job clusters (using a policy or cluster create permissions) or access to an existing interactive cluster with dbt-core and dbt-databricks libraries installed or CAN_MANAGE permissions to install the dbt-core and dbt-databricks as cluster libraries. We recommend using DBR 10.4 or later versions for better SQL compatibility.
  • Files in Repos must be enabled and is only supported on Databricks Runtime (DBR) 8.4+ or DBR 11+ depending on the configuration. Please make sure the cluster has the appropriate DBR version.
  • Install and configure the Databricks CLI
  • Install jq, a popular open source tool for parsing JSON from the command line

Run dbt as a production job

In this step, you will create a job that will run the dbt project on a schedule.

Connect Databricks to Git

The dbt task only supports retrieve dbt projects from Git. Please follow the documentation to connect Databricks to Git.

Create a job

  1. Log in to your Databricks workspace
  2. Click the Data Science & Engineering persona in the left navigation bar
  3. Click Workflows
  4. Click Create Job
  5. Click Type and choose dbt
  6. Click Edit next to "Git provider"
  7. In the dialog, enter your Git repository URL, and choose the Git provider. Also, choose a branch / tag / commit e.g. main.
  8. If your dbt project is in the root of the git repository, leave the Path field empty. Otherwise, provide the relative path e.g. /my/relative/path.
  9. You can customize dbt commands as needed, including any flag accepted by the dbt CLI.

dbt-task-type

  1. By default, Databricks installs a recent version of dbt-databricks from PyPi, which will also install dbt-spark as well as dbt-core. You can customize this version if you wish.
  2. You can customize the Automated Cluster if you wish by clicking Edit in the Cluster dropdown.
  3. Click Save

Run the job and view dbt output

You can now run your newly-saved job and see its output.

  1. Click Run Now on the notification that shows up when you save the job
  2. Click the active run and see dbt output. Note that dbt output is not real-time, it lags behind dbt's progress by several seconds to a minute.

Retrieve dbt artifacts using the Jobs API

A dbt run generates useful artifacts which you may want to retrieve for analysis and more. Databricks saves the contents of /logs and /target directories as a compressed archive which you can retrieve using the Jobs API.

It is currently not possible to refer to a previous run's artifacts e.g. using the --state flag. You can, however, include a known good state in your repository.

dbt-artifacts is a popular dbt package for ingesting dbt artifacts into tables. This is currently not supported on Databricks. Please contact us if you are interested in Databricks supporting this package.

Follow these steps to retrieve dbt artifacts from a job run:

  1. Go to a job in Databricks and copy the Task Run ID. It appears in the sidebar under Task run details when you click on a run.
  2. Enter the following command in your terminal:
$ databricks jobs configure --version=2.1
$ databricks runs get --run-id TASK_RUN_ID | jq .tasks
  1. The above command will return an array of tasks with their run_ids. Find the dbt task's run_id and run this command:
$ DBT_ARTIFACT_URL="$(databricks runs get-output --run-id DBT_TASK_RUN_ID | jq -r .dbt_output.artifacts_link)"
$ curl $DBT_ARTIFACT_URL --output artifact.tar.gz

On macOS or Linux, you can run the following command to expand and decompress the archive:

$ tar -xvf artifact.tar.gz

Common issues

Unable to connect to Databricks

  • You must provide a profiles.yml file for now in the root of the Git repository. Please check that this file is present and is properly named e.g. it is not profile.yml
  • If you do not use the automatically-generated profiles.yml, check your Personal Access Token (PAT). It must not be expired.
  • Consider adding dbt debug as the first command. This may give you a clue about the failure.

dbt cannot find my dbt_project.yml file

If you have checked out the Git repository before enabling the Files in Repos feature, the checkout might be cached invalidly. You need to push a dummy commit to your repository to force a fresh checkout.

Connecting to different sources (custom profile)

By default the dbt task type will connect to the Automated Cluster dbt-core is running on without any configuration changes or need to check in any secrets. It does so by generating a default profiles.yml and telling dbt to use it. We have no restrictions on connection to any other dbt targets such as Databricks SQL, Amazon Redshift, Google BigQuery, Snowflak, or any other supported adapter. The automatically generated profile can be overridden by specifying an alternative profiles directory in the dbt command using --profiles-dir <dir>, where the path of the <dir> should be a relative path like . or ./my-directory.

If you'd like to connect to multiple outputs and include the current Automated Cluster as one of those, the following configuration can be used without exposing any secrets:

databricks_demo:
 target: databricks_cluster
 outputs:
   databricks_cluster:
     type: databricks
     connect_retries: 5
     connect_timeout: 180
     schema: "<your-schema>"
     threads: 8 # This can be increased or decreased to control the parallism
     host: "{{ env_var('DBT_HOST') }}"
     http_path: "sql/protocolv1/o/{{ env_var('DBT_ORG_ID') }}/{{ env_var('DBT_CLUSTER_ID') }}"
     token: "{{ env_var('DBT_ACCESS_TOKEN') }}"