Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added proto definitions for Data Preparations #1788

Merged
merged 2 commits into from
Jul 18, 2024
Merged

Conversation

fernst
Copy link
Collaborator

@fernst fernst commented Jul 17, 2024

No description provided.

@fernst fernst requested review from chtyim and Ekrekr July 17, 2024 18:07
Comment on lines 352 to 356
message TableReference {
string project = 1;
string dataset = 2;
string table = 3;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems strange to duplicate the table target like this. Target is already defined

message Target {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks similar, it just uses BigQuery specific terminology. Data Preparation is based on a BigQuery Source/Destination, thus we defined the protos with BQ specific terminology.

Copy link
Contributor

@Ekrekr Ekrekr Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely confusing for future readers! (fixed by separate proto suggestion)

Comment on lines 268 to 271
message DataPreparationDefinition {
repeated DataPreparationNode nodes = 1;
DataPreparationGenerated generated = 2;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values need to be populated from somewhere, I'd imagine that the config proto needs more options that roughly correspond to the data in here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the Data Preparation YAML file is formatted according to this proto. During processing, we will parse the data preparation YAML and build this proto object from the parse result.

Copy link
Contributor

@Ekrekr Ekrekr Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clear up any confusion - parsing YAML files as objects is the job of the config.proto, not the core.proto

Comment on lines 268 to 271
message DataPreparationDefinition {
repeated DataPreparationNode nodes = 1;
DataPreparationGenerated generated = 2;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why this is structured in this way - with nested nodes.

There are many downsides to having a node that contains other nested nodes:

  • What prevents it from becoming a cyclic dag?
  • How would one represent this in a UI?

It would be much better to keep each node as separate actions instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DAG may contain branches or nodes that join 2 other nodes. The DAG is structured in a way that a node always points into a Source:

  • A BigQuery Table
  • 1 previous node
  • 2 previous nodes, which are combined using a SQL JOIN operation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is essentially a copy of the public proto. Ideally we want to keep this in a healthy DAG structure, as described in my comment about avoiding nodes that contain nested nodes.

Ideally we would destructure it to preserve the compiled graph structure, then restructure it at execution time. For example, the execution proto could contain essentially what has been put here currently.

(note this is the execution proto for the CLI - internally we have a protos, but they look very similar).

message ExecutionAction {

I'm interested in your thoughts on this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From chat outside the PR - agree that keeping it as it is, as a nested DAG, is the best way forwards!

string name = 1;

// Targets of actions that this action is dependent on.
repeated Target dependency_targets = 2;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just call it dependencies as the type is already Target, no need to repeat it in the field name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dependencies is a better name, but with the other action types dependencyTargets is used to keep backwards compatibility with old versions - it would be best to stay consistent with them.

More context:

Copy link
Contributor

@Ekrekr Ekrekr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'm happy with this now, but one main change: I think DataPreparationDefinition, and its children definitions should be moved to its own proto file:

  • It removes confusion about why we have multiple targets that look identical (agreed that it does make sense for them to be different protos), and why these nodes are different to Dataform's "action" concept.
  • You can copybara the internal proto out (or in), so that keeping them in sync is easier.
  • It seems to me that you'll always want the data prep config and what's sent in the compiled graph to be identical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From chat outside the PR - agree that keeping it as it is, as a nested DAG, is the best way forwards!

Comment on lines 352 to 356
message TableReference {
string project = 1;
string dataset = 2;
string table = 3;
}
Copy link
Contributor

@Ekrekr Ekrekr Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely confusing for future readers! (fixed by separate proto suggestion)

Comment on lines 268 to 271
message DataPreparationDefinition {
repeated DataPreparationNode nodes = 1;
DataPreparationGenerated generated = 2;
}
Copy link
Contributor

@Ekrekr Ekrekr Jul 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clear up any confusion - parsing YAML files as objects is the job of the config.proto, not the core.proto

@fernst fernst merged commit 6635abc into main Jul 18, 2024
4 checks passed
@fernst fernst deleted the data-preparation-proto branch July 18, 2024 19:51
bmagyarkuti pushed a commit to bmagyarkuti/dataform that referenced this pull request Jul 23, 2024
* Added proto definitions for Data Preparations

* Moved Data preparation protos into a separate file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants