-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/configure serverless dataproc batch #381
Feature/configure serverless dataproc batch #381
Conversation
This adds a `dataproc_batch` key for specifying the Dataproc Batch configuration. At runtime this is used to populate the google.cloud.dataproc_v1.types.Batch object before it is submitted to the Dataproc service. To avoid having to add explicit support for every option offered by the service, and having to chase after a moving target as Google's API evolves, this key accepts arbitrary yaml, which is mapped to the Batch object on a best effort basis. Signed-off-by: Torkjel Hongve <th@kinver.io>
- Make dataproc_batch key optional. - Unit tests - Move configuration of the `google.cloud.dataproc_v1.Batch` object to a separate function. Signed-off-by: Torkjel Hongve <th@kinver.io>
thanks @torkjel! we may not be able to fully review this PR this week given a company-wide event, but we'll take a look and work to get this merged! |
@torkjel Nice work! I'm stoked for this to get merged. |
I removed the custom translation from |
Hey, is there anything more I can do to move this forward? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@torkjel Thank you so much for adding this and also added unit tests to it!! Really sorry that it took us a while to get back to you!
This really makes configuring serverless jobs much more flexible! Thanks!!
The changes looks great to me, with some small issues I run into. Let me know if you can't reproduce it.
One question regarding this configuration is that do you think it might be useful to configure them per model? With default being what the profile provided?
# Apply configuration from dataproc_batch key, possibly overriding defaults. | ||
if self.credential.dataproc_batch: | ||
try: | ||
self._configure_batch_from_config(self.credential.dataproc_batch, batch) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gives me
Failed to parse runtime_config field: Failed to parse properties field: expected string or bytes-like object..
when I try to add
labels:
role: "dev"
runtime_config:
properties:
spark.executor.instances: 3
spark.driver.memory: "1g"
to my bigquery profile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It probably changed behaviour somewhat and broke my example, when i replaced my custom parsing code with ParseDict
from the protobuff library. I suspect it requires all values to be strings. I'll run some test later tonight.
BTW @torkjel do you mind add a changelog to this pr? instruction here. @colin-rogers-dbt @McKnight-42 Do you know anything about why code checks are failing? |
@ chenyu seeing this in the logs
didn't we just do some protobuf stuff? might need to update their branch with main possibly? or possibly tied to using from |
@McKnight-42 I think the dev requirement is being added here. Should this be added in some other file? |
@ChenyuLInx going to try and rerun tests, looks like it might just be a install problem? wondering if github messed up want to rule it out |
installing this branch locally shows error messages
so we may need to pin protobuf version |
Is there anything that can be done to move this forward? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to hold off on this until we have working functional tests.
Pulling changes into new PR: dbt-labs/dbt-core#7115 |
I'd like to test out the feature of setting a network/subnetwork for the dataproc connection. Is there an example of how to set this in a job, or in DBT Cloud config? |
resolves #350
and superceedes #372
Description
This adds support for configuring all input values of a GCP serverless dataproc batch job. It does so by adding a
dataproc_batch
configuration key which can hold arbitrary configuration. It's the user's repsonsibility to ensure this corresponds to the structure of the https://cloud.google.com/dataproc-serverless/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.Batch objectBy not validating this configuration in dbt, one automatically gains full coverage of current features and support for new features as they are made available. One example of this is the ExecutionConfig.idle_ttl property, which is in the dataproc api, but not yet in the python client libs.
If the user attempts to set illegal configuration, this will typically cause an exception:
Batch
object, if the configuration structure does not match theBatch
classOne risk with this approach is that the code which reconciles the yaml configuration and(Replaced custom dict to protobuf message parsing withgoogle.cloud.dataproc.v1.Batch
object might not correctly handle future additions, if they differ significantely in structure or types used. E.g. arrays of objects would not be handled currently as the code stands.google.protobuf.json_format.ParseDict
)Example profile using this feature:
Checklist
changie new
to create a changelog entry