Draft: #756 - implement python workflow submissions #762

kdazzle · 2024-08-08T22:52:24Z

WIP - Stubs out implementation for #756

This pretty much implements what a workflow job submission type would look like, though I'm sure I'm missing something. Tests haven't been added yet.

Sample

Outside of the new submission type, models are the same. Here is what one could look like:

# my_model.py
import pyspark.sql.types as T
import pyspark.sql.functions as F


def model(dbt, session):
    dbt.config(
        materialized='incremental',
        submission_method='workflow_job'
    )

    output_schema = T.StructType([
        T.StructField("id", T.StringType(), True),
        T.StructField("odometer_meters", T.DoubleType(), True),
        T.StructField("timestamp", T.TimestampType(), True),
    ])
    return spark.createDataFrame(data=spark.sparkContext.emptyRDD(), schema=output_schema)

The config for a model could look like (forgive my jsonification...yaml data structures still freak me out):

models:
  - name: my_model
      workflow_job_config:
        email_notifications: {
          on_failure: ["reynoldxin@databricks.com"]
        }
        max_retries: 2
        timeout_seconds: 18000
        existing_job_id: 12341234  # not part of Databricks API (+ optional)
        additional_task_settings: {  # not part of Databricks API (+ optional)
          "task_key": "my_dbt_task"
        }
        post_hook_tasks: [{  # not part of Databricks API (+ optional)
          "depends_on": [{ "task_key": "my_dbt_task" }],
          "task_key": 'OPTIMIZE_AND_VACUUM',
          "notebook_task": {
            "notebook_path": "/my_notebook_path",
            "source": "WORKSPACE",
          },
        }]
        grants:  # not part of Databricks API (+ optional)
          view: [
            {"group_name": "marketing-team"},
          ]
          run: [
            {"user_name": "alighodsi@databricks.com"}
          ]
          manage: []
      job_cluster_config:
        spark_version: "15.3.x-scala2.12"
        node_type_id: "rd-fleet.2xlarge"
        runtime_engine: "STANDARD"
        data_security_mode: "SINGLE_USER"
        autoscale: {
          "min_workers": 1,
          "max_workers": 4
        }

Explanation

For all of the dbt configs that I added (in addition to the Databricks API attributes), I tried to roughly mediate between the dbt convention of requiring minimal configuration, but also allowing for the full flexibility of the Databricks API. Attribute names were trying to split the difference between the Databricks API and the dbt API. Happy to change the approach for anything.

added existing_job_id in case users want to reuse an existing workflow. If no name is provided in this config, it will get renamed to the default job name (currently f"dbt__{self.database}-{self.schema}-{self.identifier}")
Job names must be unique unless existing_job_id is also provided
The task key for the model run task is hardcoded as task_a - configurable in additional_task_settings
Allow for "post_hook tasks"
- Can specify a different cluster type using Databricks' new_cluster or existing_cluster_id. Leaving blank is serverless
- post_hook might be a misnomer, because you could technically set the dbt model to depend on one of these tasks, making it also a pre hook
grants - allow for permissions to be set on the workflow so that additional users/teams can run the job ad hoc if needed (for initial runs/backfills, etc). The owner is the user/service principal that deployed, and the format needs to follow the Databricks API where you specify whether the user is a user, group, or SP.
additional_task_settings to add to/override the default dbt model task

Todo:

Reuse all_purpose_cluster attribute, similar to job_cluster_config?
Can I use a serverless job cluster? (by not defining any cluster)
Fix the run tracker
What happens if the workflow is already running?
- I'd like the new dbt job run to start tracking the current Databricks workflow run, rather than failing
Log if workflow permissions are being changed? (Kind of mimicking TF apply logs, which have been helpful in the past when table permissions had been unexpectedly broadened)

Description

Checklist

I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

…ties

dbt/adapters/databricks/python_submissions.py

Signed-off-by: Kyle Valade <kylevalade@rivian.com>

… on job Signed-off-by: Kyle Valade <kylevalade@rivian.com>

Signed-off-by: Kyle Valade <kylevalade@rivian.com>

…be reset Signed-off-by: Kyle Valade <kylevalade@rivian.com>

… attempting to trigger a new run

benc-db · 2024-09-27T21:07:04Z

@kdazzle can you rebase/target your PR against 1.9.latest? I have a couple of things that I need to wrap up, but I'm planning to take some version of this into the 1.9 release.

dbt/adapters/databricks/python_models/python_submissions.py

…r_key in post hooks

docs/workflow-job-submission.md

kdazzle · 2024-10-08T21:43:32Z

tests/unit/python/test_python_submissions.py

@@ -247,97 +247,3 @@ def test_build_job_spec_with_post_hooks(self, mock_api_client):
        assert len(result["tasks"]) == 2
        assert result["tasks"][1]["task_key"] == "task_b"
        assert result["tasks"][1]["new_cluster"]["spark_version"] == "14.3.x-scala2.12"
-
-    @patch("dbt.adapters.databricks.python_models.python_submissions.DatabricksApiClient")
-    def test_build_job_spec_with_post_hooks_serverless_job_cluster(self, mock_api_client):


Removing these since the logic to muck around with the cluster settings in additional tasks was removed here

benc-db · 2024-10-10T21:13:34Z

Going to merge in 1.9.latest changes (which is basically only 1.8 changes), ensure tests still pass, then merge.

databricks#756 - stub out implementation for python workflow submissions

f03b3e5

kdazzle requested review from andrefurlan-db, benc-db and rcypher-databricks as code owners August 8, 2024 22:52

kdazzle changed the title ~~#756 - stub out implementation for python workflow submissions~~ Draft: #756 - stub out implementation for python workflow submissions Aug 8, 2024

Kyle Valade added 5 commits August 12, 2024 09:57

Make this runnable

1f8a17e

Allow for job updates

16cb9df

doccomments to code

f92af82

Allow for existing cluster id; pull notebook paths + dirs into proper…

36be7c3

…ties

Build path from dir, not vice versa

8affd9b

kdazzle commented Aug 12, 2024

View reviewed changes

dbt/adapters/databricks/python_submissions.py Outdated Show resolved Hide resolved

Kyle Valade added 2 commits August 12, 2024 14:21

Remove comments

2e5e09d

linting

0ba3c1d

Signed-off-by: Kyle Valade <kylevalade@rivian.com>

kdazzle mentioned this pull request Aug 13, 2024

Run dbt python models as full-fledged jobs (not one-time job submission) #756

Open

Kyle Valade added 6 commits August 13, 2024 12:08

Allow user to specify an existing workflow/job id

b4dfbe7

Don't override job name - provide a default

4a46b79

Allow additional tasks to get added to the workflow as 'post_hook_tasks'

afc1b5a

Allow for additional model task settings; allow permissions to be set…

df173aa

… on job Signed-off-by: Kyle Valade <kylevalade@rivian.com>

Update permissions regardless of whether the job is new or not

0078058

Signed-off-by: Kyle Valade <kylevalade@rivian.com>

Don't skip setting job grants if they aren't defined, as they should …

3664df2

…be reset Signed-off-by: Kyle Valade <kylevalade@rivian.com>

kdazzle changed the title ~~Draft: #756 - stub out implementation for python workflow submissions~~ Draft: #756 - implement for python workflow submissions Aug 14, 2024

kdazzle changed the title ~~Draft: #756 - implement for python workflow submissions~~ Draft: #756 - implement python workflow submissions Aug 14, 2024

Kyle Valade added 3 commits September 20, 2024 08:55

Use different run endpoint for result status

b3f6dbf

If a workflow is already running, return the active run id instead of…

3f100c4

… attempting to trigger a new run

Allow for serverless tasks

13033ed

kdazzle changed the base branch from main to 1.9.latest September 27, 2024 21:12

BROKEN: starting to merge with 1.9.latest. Need to fix api_client calls

0985af6

kdazzle commented Sep 27, 2024

View reviewed changes

dbt/adapters/databricks/python_models/python_submissions.py Outdated Show resolved Hide resolved

Add new API classes to support workflow jobs

dfaa6cc

benc-db temporarily deployed to azure-prod October 4, 2024 19:56 — with GitHub Actions Inactive

benc-db had a problem deploying to azure-prod October 4, 2024 19:56 — with GitHub Actions Failure

Kyle Valade added 2 commits October 4, 2024 13:25

Remove grants block from functional test

56e9e9a

Undo variable breakout

4223403

benc-db temporarily deployed to azure-prod October 4, 2024 21:51 — with GitHub Actions Inactive

Kyle Valade added 4 commits October 4, 2024 15:04

Fix cluster definition for post hook tasks; don't override job_cluste…

7872dca

…r_key in post hooks

Always be linting

9726b1d

Documentation first draft 😵

fa3f7fe

Additional tests + documentation for post hooks

ae11f04

kdazzle commented Oct 7, 2024

View reviewed changes

docs/workflow-job-submission.md Outdated Show resolved Hide resolved

Kyle Valade added 3 commits October 7, 2024 15:24

Don't reuse cluster settings for post hook tasks

38c3ce4

Update docs

f1b4196

Change info logs to debug for Databricks API responses

05ab199

benc-db temporarily deployed to azure-prod October 8, 2024 17:51 — with GitHub Actions Inactive

Remove post hook tests since they are no longer useful

fd4aed7

kdazzle commented Oct 8, 2024

View reviewed changes

benc-db approved these changes Oct 10, 2024

View reviewed changes

Kyle Valade and others added 2 commits October 10, 2024 11:39

Update changelog

e40adc5

Update CHANGELOG.md

08c957d

Merge branch '1.9.latest' into python-workflow-submission

adccb33

benc-db temporarily deployed to azure-prod October 10, 2024 22:09 — with GitHub Actions Inactive

benc-db merged commit 0e821b0 into databricks:1.9.latest Oct 10, 2024
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: #756 - implement python workflow submissions #762

Draft: #756 - implement python workflow submissions #762

kdazzle commented Aug 8, 2024 •

edited

Loading

benc-db commented Sep 27, 2024

kdazzle Oct 8, 2024

benc-db commented Oct 10, 2024

Draft: #756 - implement python workflow submissions #762

Draft: #756 - implement python workflow submissions #762

Conversation

kdazzle commented Aug 8, 2024 • edited Loading

Sample

Explanation

Todo:

Description

Checklist

benc-db commented Sep 27, 2024

kdazzle Oct 8, 2024

Choose a reason for hiding this comment

benc-db commented Oct 10, 2024

kdazzle commented Aug 8, 2024 •

edited

Loading