Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/configure serverless dataproc batch #381

Conversation

torkjel
Copy link
Contributor

@torkjel torkjel commented Nov 9, 2022

resolves #350
and superceedes #372

Description

This adds support for configuring all input values of a GCP serverless dataproc batch job. It does so by adding a dataproc_batch configuration key which can hold arbitrary configuration. It's the user's repsonsibility to ensure this corresponds to the structure of the https://cloud.google.com/dataproc-serverless/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.Batch object

By not validating this configuration in dbt, one automatically gains full coverage of current features and support for new features as they are made available. One example of this is the ExecutionConfig.idle_ttl property, which is in the dataproc api, but not yet in the python client libs.

If the user attempts to set illegal configuration, this will typically cause an exception:

  • when updating the Batch object, if the configuration structure does not match the Batch class
  • when submitting the job, if the value of a configuration setting is illegal.

One risk with this approach is that the code which reconciles the yaml configuration and google.cloud.dataproc.v1.Batch object might not correctly handle future additions, if they differ significantely in structure or types used. E.g. arrays of objects would not be handled currently as the code stands. (Replaced custom dict to protobuf message parsing with google.protobuf.json_format.ParseDict)

Example profile using this feature:

myproject:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth
      project: "{{ env_var('GCP_PROJECT') }}"
      dataset: tmp
      threads: 4
      gcs_bucket: "my-bucket"
      dataproc_region: "europe-west1"
      submission_method: serverless
      dataproc_batch:
        environment_config:
          execution_config:
            service_account: "dbt@{{ env_var('GCP_PROJECT') }}.iam.gserviceaccount.com"
            subnetwork_uri: "dataproc"
        labels:
          project: "my-project"
          role: "dev"
        runtime_config:
          properties:
            spark.executor.instances: 3
            spark.driver.memory: "1g"

Checklist

Torkjel Hongve added 2 commits November 9, 2022 10:16
This adds a `dataproc_batch` key for specifying the Dataproc Batch
configuration. At runtime this is used to populate the
google.cloud.dataproc_v1.types.Batch object before it is submitted to
the Dataproc service.

To avoid having to add explicit support for every option offered by the
service, and having to chase after a moving target as Google's API evolves,
this key accepts arbitrary yaml, which is mapped to the Batch object on
a best effort basis.

Signed-off-by: Torkjel Hongve <th@kinver.io>
- Make dataproc_batch key optional.
- Unit tests
- Move configuration of the `google.cloud.dataproc_v1.Batch` object
  to a separate function.

Signed-off-by: Torkjel Hongve <th@kinver.io>
@torkjel
Copy link
Contributor Author

torkjel commented Nov 9, 2022

@lostmygithubaccount

@lostmygithubaccount
Copy link

thanks @torkjel! we may not be able to fully review this PR this week given a company-wide event, but we'll take a look and work to get this merged!

@lostmygithubaccount lostmygithubaccount added feature:python-models triage:ready-for-review Externally contributed PR has functional approval, ready for code review from Core engineering labels Nov 9, 2022
@Aylr
Copy link

Aylr commented Nov 10, 2022

@torkjel Nice work! I'm stoked for this to get merged.

@torkjel
Copy link
Contributor Author

torkjel commented Nov 14, 2022

I removed the custom translation from dataproc_config dict to Batch object., and let it use ParseDict from the protobuf library instead. This makes the test_configure_dataproc_batch.py unit test a bit pointless, as it ends up just testing the protobuf library.

@ChenyuLInx ChenyuLInx closed this Nov 14, 2022
@ChenyuLInx ChenyuLInx reopened this Nov 14, 2022
@torkjel
Copy link
Contributor Author

torkjel commented Nov 22, 2022

Hey, is there anything more I can do to move this forward?

Copy link
Contributor

@ChenyuLInx ChenyuLInx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@torkjel Thank you so much for adding this and also added unit tests to it!! Really sorry that it took us a while to get back to you!

This really makes configuring serverless jobs much more flexible! Thanks!!

The changes looks great to me, with some small issues I run into. Let me know if you can't reproduce it.

One question regarding this configuration is that do you think it might be useful to configure them per model? With default being what the profile provided?

# Apply configuration from dataproc_batch key, possibly overriding defaults.
if self.credential.dataproc_batch:
try:
self._configure_batch_from_config(self.credential.dataproc_batch, batch)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gives me

Failed to parse runtime_config field: Failed to parse properties field: expected string or bytes-like object..

when I try to add

        labels:
          role: "dev"
        runtime_config:
          properties:
            spark.executor.instances: 3
            spark.driver.memory: "1g"

to my bigquery profile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably changed behaviour somewhat and broke my example, when i replaced my custom parsing code with ParseDict from the protobuff library. I suspect it requires all values to be strings. I'll run some test later tonight.

@ChenyuLInx
Copy link
Contributor

BTW @torkjel do you mind add a changelog to this pr? instruction here.

@colin-rogers-dbt @McKnight-42 Do you know anything about why code checks are failing?

@McKnight-42
Copy link
Contributor

McKnight-42 commented Dec 10, 2022

@ chenyu seeing this in the logs

dbt/adapters/bigquery/python_submissions.py:9: error: Library stubs not installed for "google.protobuf.json_format" (or incompatible with Python 3.8)  [import]
dbt/adapters/bigquery/python_submissions.py:9: note: Hint: "python3 -m pip install types-protobuf"
dbt/adapters/bigquery/python_submissions.py:9: note: (or run "mypy --install-types" to install all missing stub packages)
dbt/adapters/bigquery/python_submissions.py:9: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#missing-imports

didn't we just do some protobuf stuff? might need to update their branch with main possibly?

or possibly tied to using from google.protobuf.json_format import ParseDict in python_submissions.py would we need to add protobuf or protobuf-types to dbt-bigquery dev-requrements.txt?

@ChenyuLInx
Copy link
Contributor

@McKnight-42 I think the dev requirement is being added here. Should this be added in some other file?

@McKnight-42 McKnight-42 reopened this Dec 12, 2022
@McKnight-42
Copy link
Contributor

@ChenyuLInx going to try and rerun tests, looks like it might just be a install problem? wondering if github messed up want to rule it out

@McKnight-42
Copy link
Contributor

installing this branch locally shows error messages

ERROR: tox 4.0.8 has requirement packaging>=22, but you'll have packaging 21.3 which is incompatible.
ERROR: dbt-bigquery 1.4.0a1 has requirement protobuf<4,>=3.13.0, but you'll have protobuf 4.21.11 which is incompatible.

so we may need to pin protobuf version

@candalfigomoro
Copy link

Is there anything that can be done to move this forward?
The ability to specify a custom image for Dataproc Serverless (see #384) would be really useful.

Copy link
Contributor

@colin-rogers-dbt colin-rogers-dbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to hold off on this until we have working functional tests.

hasyimibhar pushed a commit to ridebeam/dbt-bigquery-old that referenced this pull request Mar 2, 2023
@colin-rogers-dbt
Copy link
Contributor

Pulling changes into new PR: dbt-labs/dbt-core#7115

hasyimibhar pushed a commit to ridebeam/dbt-bigquery that referenced this pull request Mar 3, 2023
@gthomas-strike
Copy link

I'd like to test out the feature of setting a network/subnetwork for the dataproc connection.

Is there an example of how to set this in a job, or in DBT Cloud config?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes feature:python-models triage:ready-for-review Externally contributed PR has functional approval, ready for code review from Core engineering
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-1336] [Feature] allow setting networkUri and subnetworkUri for Dataproc Serverless batches
9 participants