-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added proto definitions for Data Preparations #1788
Conversation
protos/core.proto
Outdated
message TableReference { | ||
string project = 1; | ||
string dataset = 2; | ||
string table = 3; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems strange to duplicate the table target like this. Target is already defined
Line 47 in 97123fa
message Target { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks similar, it just uses BigQuery specific terminology. Data Preparation is based on a BigQuery Source/Destination, thus we defined the protos with BQ specific terminology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely confusing for future readers! (fixed by separate proto suggestion)
protos/core.proto
Outdated
message DataPreparationDefinition { | ||
repeated DataPreparationNode nodes = 1; | ||
DataPreparationGenerated generated = 2; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values need to be populated from somewhere, I'd imagine that the config proto needs more options that roughly correspond to the data in here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the Data Preparation YAML file is formatted according to this proto. During processing, we will parse the data preparation YAML and build this proto object from the parse result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clear up any confusion - parsing YAML files as objects is the job of the config.proto
, not the core.proto
protos/core.proto
Outdated
message DataPreparationDefinition { | ||
repeated DataPreparationNode nodes = 1; | ||
DataPreparationGenerated generated = 2; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand why this is structured in this way - with nested nodes.
There are many downsides to having a node that contains other nested nodes:
- What prevents it from becoming a cyclic dag?
- How would one represent this in a UI?
It would be much better to keep each node as separate actions instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DAG may contain branches or nodes that join 2 other nodes. The DAG is structured in a way that a node always points into a Source:
- A BigQuery Table
- 1 previous node
- 2 previous nodes, which are combined using a SQL JOIN operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is essentially a copy of the public proto. Ideally we want to keep this in a healthy DAG structure, as described in my comment about avoiding nodes that contain nested nodes.
Ideally we would destructure it to preserve the compiled graph structure, then restructure it at execution time. For example, the execution proto could contain essentially what has been put here currently.
(note this is the execution proto for the CLI - internally we have a protos, but they look very similar).
dataform/protos/execution.proto
Line 32 in 97123fa
message ExecutionAction { |
I'm interested in your thoughts on this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From chat outside the PR - agree that keeping it as it is, as a nested DAG, is the best way forwards!
string name = 1; | ||
|
||
// Targets of actions that this action is dependent on. | ||
repeated Target dependency_targets = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just call it dependencies
as the type is already Target, no need to repeat it in the field name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dependencies
is a better name, but with the other action types dependencyTargets
is used to keep backwards compatibility with old versions - it would be best to stay consistent with them.
More context:
- SQLX action blocks have a config option
dependencies
https://github.com/dataform-co/dataform/blob/97123faf432781480bee6f888a4881d934aff349/core/actions/table.ts#L218C6-L218C18. - This SQLX config option is of type resolvable, which is target (as a JS object) or of string
dataform/core/actions/table.ts
Line 579 in 97123fa
public dependencies(value: Resolvable | Resolvable[]) { - Proto based configs are however strictly the object type.
- In unifying the SQLX and proto based definitions, we map from the string type to the object type
dataform/core/actions/assertion.ts
Line 202 in 97123fa
if (unverifiedConfig.dependencies) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I'm happy with this now, but one main change: I think DataPreparationDefinition
, and its children definitions should be moved to its own proto file:
- It removes confusion about why we have multiple targets that look identical (agreed that it does make sense for them to be different protos), and why these nodes are different to Dataform's "action" concept.
- You can copybara the internal proto out (or in), so that keeping them in sync is easier.
- It seems to me that you'll always want the data prep config and what's sent in the compiled graph to be identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From chat outside the PR - agree that keeping it as it is, as a nested DAG, is the best way forwards!
protos/core.proto
Outdated
message TableReference { | ||
string project = 1; | ||
string dataset = 2; | ||
string table = 3; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely confusing for future readers! (fixed by separate proto suggestion)
protos/core.proto
Outdated
message DataPreparationDefinition { | ||
repeated DataPreparationNode nodes = 1; | ||
DataPreparationGenerated generated = 2; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clear up any confusion - parsing YAML files as objects is the job of the config.proto
, not the core.proto
* Added proto definitions for Data Preparations * Moved Data preparation protos into a separate file
No description provided.