Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vendor Generated Protobuf Code (#3947) #3950

Merged

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #3538
Closes #3947

Rationale for this change

We don't want to require downstreams to install PROTOC, but we also don't want to allow the generated code to get out of sync. This is the approach we use with arrow-rs, and it appears to work fairly well. Specifically we don't bundle the .proto files in the released crate, and only generate the code if the .proto files exist.

What changes are included in this PR?

Are there any user-facing changes?

datafusion-proto now always has a build dependency on pbjson-build. This is a fairly lightweight dependency and makes the logic significantly easier to follow.

@tustvold
Copy link
Contributor Author

@avantgardnerio What do you think about this?

@@ -129,3 +129,5 @@ Cargo.lock
.history
parquet-testing/*
*rat.txt
datafusion/proto/src/generated/pbjson.rs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is precedent in arrow for excluding generated files - https://github.com/apache/arrow-rs/blob/master/dev/release/rat_exclude_files.txt#L20

@isidentical
Copy link
Contributor

@tustvold does it make sense to also include a GH action to ensure that PRs that actually touch the source also generate the latest version of the declarations? Something like the following:
https://github.com/apache/arrow-datafusion/blob/7559c4425e6f32655c6d09e8ed17c9c51896472b/.github/workflows/rust.yml#L65-L67

@avantgardnerio
Copy link
Contributor

What do you think about this?

My main concern is about making forward progress for the project, and I think not having our dependencies install protoc is essential for that. Both our PRs (see 3948 accomplish that goal.

My secondary concern is for general project maintainability, and for that I think there's issues either way:

checked in pros:

  1. it's less complex from a technical standpoint
  2. it's easier to ensure docs.rs works

checked in cons:

  1. different versions of protoc will generate different output
  2. we'll need to document how to run protoc and keep it up to date
  3. formatters will create git conflicts
  4. the proto files and the checked in code can get out of sync
  5. users might edit the generated code accidentally

generated pros:

  1. the "download the binary and unzip" should be transparent to most users
  2. anyone is free to use their own compiler
  3. redundant derived data is not stored in source control
  4. it is simpler from an operational perspective

generated cons:

  1. it is more complex technically
  2. it requires maintenance (as does checking in)

I must highlight though that I am concerned this does not resolve #3538 as that appears to be working as per my comment there.

@andygrove
Copy link
Member

I am ok with either approach, but I think I would prefer to check in the generated sources rather than have build.rs download protoc. It seems like this could be a security risk and also adds complexity. I don't want to try and support a user who hits some issue with protoc not working on their system, I am possibly just trying to avoid extra work here. 😅

formatters will create git conflicts

We can configure rustfmt to ignore the generated files (https://rust-lang.github.io/rustfmt/?version=v1.5.1&search=#ignore). @tustvold would you be ok with adding that to this PR?

the proto files and the checked-in code can get out of sync

We should have CI prevent that from happening. Would be good to include that in this PR if possible.

users might edit the generated code accidentally

This is bound to happen, but the files are in a generated directory, so I think people will learn pretty quickly not to do this.

@avantgardnerio
Copy link
Contributor

I'd approve this PR if:

  1. generated files have a formatter exception
  2. there's clear doc on how to run protoc in a readme somewhere

@andygrove
Copy link
Member

Can we provide a Dockerfile with protoc for generating the code?

Copy link
Contributor

@avantgardnerio avantgardnerio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like everyone is leaning towards this solution. As stated above, I think the main concern is that dependents don't have to install protoc, and this PR achieves that goal, so I'm approving and closing mine.

@tustvold
Copy link
Contributor Author

tustvold commented Oct 25, 2022

generated files have a formatter exception

Already in place

there's clear doc on how to run protoc in a readme somewhere

This is still automatically run as part of build.rs, I will add some docs though

different versions of protoc will generate different output

Protoc is only used to generate a descriptor set, i.e. parse the proto files, not the generated Rust code which is performed by prost-build. The output should therefore be relatively stable.

dependents don't have to install protoc

I suspect dependencies by git sha may still need protoc installed - as this is effectively a source dependency I think this is fine

We should have CI prevent that from happening

I will add a CI check to here and arrow-rs

@@ -81,16 +81,6 @@ jobs:
- uses: actions/checkout@v3
with:
submodules: true
- name: Install protobuf compiler
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be necessary anymore, the version in the system package manager is recent enough (only 2 years out of date) 😆

$ brew install protobuf
```

You will want to verify the version installed is `3.12` or greater, which introduced support for explicit [field presence](https://github.com/protocolbuffers/protobuf/blob/v3.12.0/docs/field_presence.md). Older versions may fail to compile.
Copy link
Contributor Author

@tustvold tustvold Oct 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFusion doesn't actually use this feature, for good reason imo, but this is the only major functionality change to the core protoc functionality that may cause compatibility issues for us. More importantly 3.12 is the version currently used in CI

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @tustvold and @avantgardnerio

@tustvold tustvold merged commit 11c5255 into apache:master Oct 27, 2022
@ursabot
Copy link

ursabot commented Oct 27, 2022

Benchmark runs are scheduled for baseline = fa5cd7f and contender = 11c5255. 11c5255 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Dandandan pushed a commit to yuuch/arrow-datafusion that referenced this pull request Nov 5, 2022
* Vendor generated protobuf code (apache#3947)

* RAT

* Fix build without json

* Review feedback

* Doc tweak

* Fix Arch install instructions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't make dependants install protoc docs.rs cannot build datafusion-proto crate
5 participants