Feature/configure serverless dataproc batch #381

torkjel · 2022-11-09T09:54:57Z

resolves #350
and superceedes #372

Description

This adds support for configuring all input values of a GCP serverless dataproc batch job. It does so by adding a dataproc_batch configuration key which can hold arbitrary configuration. It's the user's repsonsibility to ensure this corresponds to the structure of the https://cloud.google.com/dataproc-serverless/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.Batch object

By not validating this configuration in dbt, one automatically gains full coverage of current features and support for new features as they are made available. One example of this is the ExecutionConfig.idle_ttl property, which is in the dataproc api, but not yet in the python client libs.

If the user attempts to set illegal configuration, this will typically cause an exception:

when updating the Batch object, if the configuration structure does not match the Batch class
when submitting the job, if the value of a configuration setting is illegal.

One risk with this approach is that the code which reconciles the yaml configuration and google.cloud.dataproc.v1.Batch object might not correctly handle future additions, if they differ significantely in structure or types used. E.g. arrays of objects would not be handled currently as the code stands. (Replaced custom dict to protobuf message parsing with google.protobuf.json_format.ParseDict)

Example profile using this feature:

myproject:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: "{{ env_var('GCP_PROJECT') }}"
      dataset: tmp
      threads: 4
      gcs_bucket: "my-bucket"
      dataproc_region: "europe-west1"
      submission_method: serverless
      dataproc_batch:
        environment_config:
          execution_config:
            service_account: "dbt@{{ env_var('GCP_PROJECT') }}.iam.gserviceaccount.com"
            subnetwork_uri: "dataproc"
        labels:
          project: "my-project"
          role: "dev"
        runtime_config:
          properties:
            spark.executor.instances: 3
            spark.driver.memory: "1g"

Checklist

I have read the contributing guide and understand what's expected of me
I have signed the CLA
I have run this code in development and it appears to resolve the stated issue
This PR includes tests, or tests are not required/relevant for this PR
I have opened an issue to add/update docs, or docs changes are not required/relevant for this PR
I have run changie new to create a changelog entry

This adds a `dataproc_batch` key for specifying the Dataproc Batch configuration. At runtime this is used to populate the google.cloud.dataproc_v1.types.Batch object before it is submitted to the Dataproc service. To avoid having to add explicit support for every option offered by the service, and having to chase after a moving target as Google's API evolves, this key accepts arbitrary yaml, which is mapped to the Batch object on a best effort basis. Signed-off-by: Torkjel Hongve <th@kinver.io>

- Make dataproc_batch key optional. - Unit tests - Move configuration of the `google.cloud.dataproc_v1.Batch` object to a separate function. Signed-off-by: Torkjel Hongve <th@kinver.io>

torkjel · 2022-11-09T10:04:32Z

@lostmygithubaccount

lostmygithubaccount · 2022-11-09T19:39:00Z

thanks @torkjel! we may not be able to fully review this PR this week given a company-wide event, but we'll take a look and work to get this merged!

Aylr · 2022-11-10T22:42:19Z

@torkjel Nice work! I'm stoked for this to get merged.

torkjel · 2022-11-14T07:59:38Z

I removed the custom translation from dataproc_config dict to Batch object., and let it use ParseDict from the protobuf library instead. This makes the test_configure_dataproc_batch.py unit test a bit pointless, as it ends up just testing the protobuf library.

torkjel · 2022-11-22T14:47:31Z

Hey, is there anything more I can do to move this forward?

ChenyuLInx

@torkjel Thank you so much for adding this and also added unit tests to it!! Really sorry that it took us a while to get back to you!

This really makes configuring serverless jobs much more flexible! Thanks!!

The changes looks great to me, with some small issues I run into. Let me know if you can't reproduce it.

One question regarding this configuration is that do you think it might be useful to configure them per model? With default being what the profile provided?

ChenyuLInx · 2022-12-10T01:16:13Z

dbt/adapters/bigquery/python_submissions.py

+        # Apply configuration from dataproc_batch key, possibly overriding defaults.
+        if self.credential.dataproc_batch:
+            try:
+                self._configure_batch_from_config(self.credential.dataproc_batch, batch)


This gives me

Failed to parse runtime_config field: Failed to parse properties field: expected string or bytes-like object..

when I try to add

labels: role: "dev" runtime_config: properties: spark.executor.instances: 3 spark.driver.memory: "1g"

to my bigquery profile.

It probably changed behaviour somewhat and broke my example, when i replaced my custom parsing code with ParseDict from the protobuff library. I suspect it requires all values to be strings. I'll run some test later tonight.

ChenyuLInx · 2022-12-10T01:23:08Z

BTW @torkjel do you mind add a changelog to this pr? instruction here.

@colin-rogers-dbt @McKnight-42 Do you know anything about why code checks are failing?

McKnight-42 · 2022-12-10T06:33:05Z

@ chenyu seeing this in the logs

dbt/adapters/bigquery/python_submissions.py:9: error: Library stubs not installed for "google.protobuf.json_format" (or incompatible with Python 3.8)  [import]
dbt/adapters/bigquery/python_submissions.py:9: note: Hint: "python3 -m pip install types-protobuf"
dbt/adapters/bigquery/python_submissions.py:9: note: (or run "mypy --install-types" to install all missing stub packages)
dbt/adapters/bigquery/python_submissions.py:9: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports

didn't we just do some protobuf stuff? might need to update their branch with main possibly?

or possibly tied to using from google.protobuf.json_format import ParseDict in python_submissions.py would we need to add protobuf or protobuf-types to dbt-bigquery dev-requrements.txt?

ChenyuLInx · 2022-12-12T17:09:19Z

@McKnight-42 I think the dev requirement is being added here. Should this be added in some other file?

McKnight-42 · 2022-12-12T22:36:05Z

@ChenyuLInx going to try and rerun tests, looks like it might just be a install problem? wondering if github messed up want to rule it out

McKnight-42 · 2022-12-12T22:54:35Z

installing this branch locally shows error messages

ERROR: tox 4.0.8 has requirement packaging>=22, but you'll have packaging 21.3 which is incompatible.
ERROR: dbt-bigquery 1.4.0a1 has requirement protobuf<4,>=3.13.0, but you'll have protobuf 4.21.11 which is incompatible.

so we may need to pin protobuf version

candalfigomoro · 2023-01-20T15:48:51Z

Is there anything that can be done to move this forward?
The ability to specify a custom image for Dataproc Serverless (see #384) would be really useful.

colin-rogers-dbt

We need to hold off on this until we have working functional tests.

colin-rogers-dbt · 2023-03-03T21:13:50Z

Pulling changes into new PR: dbt-labs/dbt-core#7115

gthomas-strike · 2023-03-30T17:48:47Z

I'd like to test out the feature of setting a network/subnetwork for the dataproc connection.

Is there an example of how to set this in a job, or in DBT Cloud config?

Torkjel Hongve added 2 commits November 9, 2022 10:16

Fixes and tests

7eff828

- Make dataproc_batch key optional. - Unit tests - Move configuration of the `google.cloud.dataproc_v1.Batch` object to a separate function. Signed-off-by: Torkjel Hongve <th@kinver.io>

cla-bot bot added the cla:yes label Nov 9, 2022

torkjel mentioned this pull request Nov 9, 2022

Feature/serviceaccount and network for serverless jobs #372

Closed

6 tasks

lostmygithubaccount added feature:python-models triage:ready-for-review Externally contributed PR has functional approval, ready for code review from Core engineering labels Nov 9, 2022

torkjel mentioned this pull request Nov 14, 2022

[CT-1496] [Feature] BigQuery - Custom Dataproc Images for python models #384

Closed

3 tasks

Do not reinvent protobuf parsing.

c0ed862

ChenyuLInx closed this Nov 14, 2022

ChenyuLInx reopened this Nov 14, 2022

Torkjel Hongve and others added 5 commits November 15, 2022 07:52

ws

131b2a3

Fix unit tests to run without gcloud credentials.

83d338f

Merge branch 'main' into feature/configure-serverless-dataproc-batch

ed5f297

Merge branch 'main' into feature/configure-serverless-dataproc-batch

74d1386

formatting

c266040

torkjel added 3 commits November 29, 2022 09:44

Merge branch 'main' into feature/configure-serverless-dataproc-batch

6af4601

Merge branch 'main' into feature/configure-serverless-dataproc-batch

6f4be78

Merge branch 'main' into feature/configure-serverless-dataproc-batch

867f624

ChenyuLInx reviewed Dec 10, 2022

View reviewed changes

McKnight-42 closed this Dec 12, 2022

McKnight-42 reopened this Dec 12, 2022

colin-rogers-dbt added 2 commits December 21, 2022 11:53

Merge branch 'main' into feature/configure-serverless-dataproc-batch

b8e4fac

Update dev-requirements.txt

2ef6f1c

Merge branch 'main' into feature/configure-serverless-dataproc-batch

d82ca54

Fleid mentioned this pull request Jan 27, 2023

[CT-1939] [PR Review] Feature/configure serverless dataproc batch #381 #491

Closed

Fleid added the pr_tracked label Jan 30, 2023

Merge branch 'main' into feature/configure-serverless-dataproc-batch

f9960e6

colin-rogers-dbt requested changes Feb 14, 2023

View reviewed changes

hasyimibhar pushed a commit to ridebeam/dbt-bigquery-old that referenced this pull request Mar 2, 2023

Cherry-pick dbt-labs#381

53b1742

colin-rogers-dbt mentioned this pull request Mar 3, 2023

Dataproc config for serverless in profile and fix quoting #578

Merged

6 tasks

colin-rogers-dbt closed this Mar 3, 2023

hasyimibhar pushed a commit to ridebeam/dbt-bigquery that referenced this pull request Mar 3, 2023

Cherry-pick dbt-labs/dbt-bigquery#381

21ff709

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/configure serverless dataproc batch #381

Feature/configure serverless dataproc batch #381

torkjel commented Nov 9, 2022 •

edited

Loading

torkjel commented Nov 9, 2022

lostmygithubaccount commented Nov 9, 2022

Aylr commented Nov 10, 2022

torkjel commented Nov 14, 2022

torkjel commented Nov 22, 2022

ChenyuLInx left a comment

ChenyuLInx Dec 10, 2022

torkjel Dec 12, 2022

ChenyuLInx commented Dec 10, 2022

McKnight-42 commented Dec 10, 2022 •

edited

Loading

ChenyuLInx commented Dec 12, 2022

McKnight-42 commented Dec 12, 2022

McKnight-42 commented Dec 12, 2022

candalfigomoro commented Jan 20, 2023

colin-rogers-dbt left a comment

colin-rogers-dbt commented Mar 3, 2023

gthomas-strike commented Mar 30, 2023

Feature/configure serverless dataproc batch #381

Feature/configure serverless dataproc batch #381

Conversation

torkjel commented Nov 9, 2022 • edited Loading

Description

Checklist

torkjel commented Nov 9, 2022

lostmygithubaccount commented Nov 9, 2022

Aylr commented Nov 10, 2022

torkjel commented Nov 14, 2022

torkjel commented Nov 22, 2022

ChenyuLInx left a comment

Choose a reason for hiding this comment

ChenyuLInx Dec 10, 2022

Choose a reason for hiding this comment

torkjel Dec 12, 2022

Choose a reason for hiding this comment

ChenyuLInx commented Dec 10, 2022

McKnight-42 commented Dec 10, 2022 • edited Loading

ChenyuLInx commented Dec 12, 2022

McKnight-42 commented Dec 12, 2022

McKnight-42 commented Dec 12, 2022

candalfigomoro commented Jan 20, 2023

colin-rogers-dbt left a comment

Choose a reason for hiding this comment

colin-rogers-dbt commented Mar 3, 2023

gthomas-strike commented Mar 30, 2023

torkjel commented Nov 9, 2022 •

edited

Loading

McKnight-42 commented Dec 10, 2022 •

edited

Loading