Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: clean up dependencies and feature flags #1014

Merged
merged 14 commits into from
Dec 30, 2022
Merged

Conversation

roeap
Copy link
Collaborator

@roeap roeap commented Dec 13, 2022

Description

This PR updates datafusion and related dependencies to their latest versions. Since datafusion now has improved support for loading partition columns with non string types, we update our scan methods to take advantage of that.

While working on dependencies, I took the opportunity to do some housekeeping.

  • do not use chrono with default features
  • make aws-profile from object_store optional. The upstream create explicitly discourages its usage, and it brings quite a few new dependencies, as it pulls in some aws sdk.
  • rename datafusion-ext feature to datafusion. The ext suffix is still from a time where there were less options to define features. I kept the ols feature around as an alias.

Related Issue(s)

closes #914

Documentation

@roeap roeap changed the title feat: load tables with non-string patition types feat: load tables with non-string partition types Dec 13, 2022
Copy link
Member

@rtyler rtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-ran the failed CI job, which was an unrelated cloud timeout 🤷

rust/Cargo.toml Outdated
Comment on lines 110 to 112
# AWS_PROFILE is experimental in object_store, and comes with a hefty load of dependencies.
# must be used in conjunction with s3 or s3-rustls feature
aws-profile = ["object_store/aws_profile"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind putting this functionality behind a feature flag, but this sort of change would be easier for us to discuss as a community if it were submitted separately.

Since @fvaleye just landed this functionality, I would like him to take a look at this pull request before merging too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this would have been better in a separate PR - here I was working in WSL and struggling with some memory issues, which is why this change made my live a bit easier there and then 😆.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realized, that the current setup is likely slightly broken anyhow. IF we include the feature in the object-store crate by default, this will enable aws / s3 support in the object store crate as well, however we are not enabling either s3 or s3-rustls by default on our end, in which case delta.-rs will not be able to make use of these features.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the previous comment, we could set this aws-profile feature with S3 features since it's only related to the AWS S3 backend storage.


- `s3` - enable the S3 storage backend to work with Delta Tables in AWS S3.
- `s3-rustls` - enable the S3 storage backend but rely on [rustls](https://github.com/ctz/rustls) rather than OpenSSL (`native-tls`).
- `glue` - enable the Glue data catalog to work with Delta Tables with AWS Glue.
- `azure` - enable the Azure storage backend to work with Delta Tables in Azure Data Lake Storage Gen2 accounts.
- `gcs` - enable the Google storage backend to work with Delta Tables in Google Cloud Storage.
- `datafusion-ext` - enable the `datafusion::datasource::TableProvider` trait implementation for Delta Tables, allowing them to be queried using [DataFusion](https://github.com/apache/arrow-datafusion).
- `datafusion` - enable the `datafusion::datasource::TableProvider` trait implementation for Delta Tables, allowing them to be queried using [DataFusion](https://github.com/apache/arrow-datafusion).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

rust/README.md Outdated Show resolved Hide resolved
rust/src/lib.rs Outdated Show resolved Hide resolved
rust/src/lib.rs Outdated Show resolved Hide resolved
))]
compile_error!("Feature aws-profile must be used together with s3 or s3-rustls feature");

#[cfg(all(feature = "s3", feature = "s3-rustls"))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 157 to 167
let expected = vec![
"+----+----+----+",
"| c3 | c1 | c2 |",
"+----+----+----+",
"| 5 | 4 | c |",
"| 6 | 5 | b |",
"| 4 | 6 | a |",
"+----+----+----+",
];

assert_batches_sorted_eq!(&expected, &batches);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to understand this code, this macro in datafusion is a very surprising test helper to me!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I like these helpers quite a bit, especially once we dive into more complex operations, I believe this makes it quite straight forward to reason about the tests.

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
@roeap roeap mentioned this pull request Dec 15, 2022
@roeap
Copy link
Collaborator Author

roeap commented Dec 16, 2022

@rtyler @fvaleye - after #1020 has been merged, this PR now only clean up the dependencies and does some grooming on the feature flags.Could you take another look?

@roeap roeap changed the title feat: load tables with non-string partition types feat: clean up dependencies and feature flags Dec 16, 2022
andrei-ionescu
andrei-ionescu previously approved these changes Dec 16, 2022
@@ -77,14 +77,15 @@ glibc_version = { path = "../glibc_version", version = "0.1" }

[features]
default = ["arrow", "parquet"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also add aws-profile in the default feature list? profile based auth is a very common practice in production environment.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add it as default to s3* features, since those are the only ones where it is useful?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other cleanup piece we might want: we don't have a CI job that runs cargo check --no-default-features and on the main branch it doesn't succeed. Do you want to look at addressing that in this PR? Or should we create a separate issue for that?

@roeap
Copy link
Collaborator Author

roeap commented Dec 30, 2022

Good point - happy to add this in this PR.

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great cleanup, thank you!

@roeap roeap merged commit d444cdf into delta-io:main Dec 30, 2022
@roeap roeap deleted the dependencies branch December 30, 2022 23:03
chitralverma pushed a commit to chitralverma/delta-rs that referenced this pull request Mar 17, 2023
# Description

~~This PR updates datafusion and related dependencies to their latest
versions. Since datafusion now has improved support for loading
partition columns with non string types, we update our scan methods to
take advantage of that.~~

While working on dependencies, I took the opportunity to do some
housekeeping.

- do not use chrono with default features
- make `aws-profile` from object_store optional. The upstream create
explicitly discourages its usage, and it brings quite a few new
dependencies, as it pulls in some aws sdk.
- rename `datafusion-ext` feature to `datafusion`. The ext suffix is
still from a time where there were less options to define features. I
kept the ols feature around as an alias.

# Related Issue(s)

closes delta-io#914

# Documentation

<!---
Share links to useful documentation
--->

Co-authored-by: R. Tyler Croy <rtyler@brokenco.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Chrono without default features
6 participants