Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use released version of arrow #80

Closed
tustvold opened this issue Apr 26, 2021 · 7 comments
Closed

Use released version of arrow #80

tustvold opened this issue Apr 26, 2021 · 7 comments
Labels
enhancement New feature or request

Comments

@tustvold
Copy link
Contributor

Currently the Datafusion crates pin a version of arrow by git revision. Unfortunately this makes it complicated to use Datafusion with other libraries that also depend on arrow, making it is necessary to update all the version pins in lock-step across potentially multiple repositories. #39 helps somewhat with this, but doesn't help with crates that Datafusion isn't dependent on that themselves depend on arrow, or if you want to configure different feature flags from what Datafusion sets.

Unfortunately there is an outstanding bug /feature omission with cargo that makes patch not work for overriding git version pins, and the workarounds on the ticket no longer seem to work.

If Datafusion instead depended instead on a released version of arrow, Cargo's semver compatibility rules would avoid a lot of this pain for most users, and would allow users to use the patch syntax to override the arrow version in use for all their dependencies within their workspace, if they want to use an unreleased arrow version.

I'm not sure how feasible any of this is, but thought I would raise a ticket to maybe kickstart a discussion on this

@tustvold tustvold added the enhancement New feature or request label Apr 26, 2021
@andygrove
Copy link
Member

@tustvold Does #39 help with this?

@tustvold
Copy link
Contributor Author

@tustvold Does #39 help with this?

It helps, but doesn't solve the underlying issue. If you depend on another crate that in turn depends on arrow and isn't exposed by DataFusion, e.g. arrow-flight, or you want to set different features from what DataFusion sets, you end up having to replicate the exact version pins from DataFusion into all other crates

@jorgecarleitao
Copy link
Member

I think that the general problem is that we pin arrow (and many others) in datafusion; datafusion is a library and it should thus avoid pinning dependencies.

Instead, it should bracket them, via e.g. ^3.0.0, so that consumers of the library may use a different version of any of its dependencies, for as long as they are compatible, and have cargo find a valid dependency version between what the consumer wants and what datafusion requires.

As it stands, consumers must use the exact same version of arrow that datafusion uses or cargo will pick two different arrow versions. This happens because Cargo cannot guarantee that the two different arrow versions (what datafusion demands and what the consumer wants) are ABI compatible. Consumers can't pass structs from a version of arrow (that they use) to another version of arrow (that datafusion uses).

Note that in this context a different feature set corresponds to a different version, as cargo has no way of knowing whether a feature will retain ABI compatibility.

So, I think the ask here is:

  1. datafusion writes something like ^4.0.0 on its cargo.toml
  2. we release arrow more often and strictly according to semver

Is this the idea, @tustvold ?

I am not sure whether we could get away with the multiple paths approach, e.g.

arrow = { git = "https://github.com/arrow-rs/arrow", version = "^3.0.0" }

@tustvold
Copy link
Contributor Author

datafusion writes something like ^4.0.0 on its cargo.toml

Yes, this would be my ideal for the reasons you articulated.

we release arrow more often and strictly according to semver

This would definitely be something I'd be supportive of, but possibly somewhat tangential to making DataFusion as a library easier to consume. Cargo has a good story for overriding dependencies within a workspace, including indirect dependencies, provided those versions aren't pinned within the libraries. Therefore if DataFusion were to move to a released version of arrow it wouldn't preclude users from opting-in to newer, potentially unreleased versions of arrow.

However, DataFusion itself would not be able to opt-in to unreleased arrow functionality, and so if there are frequent DataFusion changes coupled with arrow changes, then yes a more frequent arrow release cycle would possibly be a pre-condition of moving to using a released version of arrow.

I am not sure whether we could get away with the multiple paths approach, e.g.

I've not come across this approach, I'd worry that it might be vulnerable to rust-lang/cargo#5478 which would prevent users from opting into newer versions of arrow within their workspaces, which imo would be unfortunate

@jorgecarleitao
Copy link
Member

but possibly somewhat tangential to making DataFusion as a library easier to consume

I agree.

My point is that the reason we use pinned hashes of arrow is so that we do not have to wait for a new release. So, I think that to stop pinning in DataFusion, we need to release arrow more frequently. But I agree that from the consumers' point of view, it is not needed, as you can just point to a hash in arrow-rs ^_^

@jimexist
Copy link
Member

jimexist commented Jul 1, 2021

@alamb this can be closed now

@alamb
Copy link
Contributor

alamb commented Jul 1, 2021

Indeed -- thanks @jimexist -- this issue was closed in #393 I think

@alamb alamb closed this as completed Jul 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants