Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap 2022 H2 (discussion) #1307

Open
dennyglee opened this issue Aug 2, 2022 · 32 comments
Open

Roadmap 2022 H2 (discussion) #1307

dennyglee opened this issue Aug 2, 2022 · 32 comments
Labels
enhancement New feature or request

Comments

@dennyglee
Copy link
Contributor

dennyglee commented Aug 2, 2022

This is a working issue for folks to provide feedback on the prioritization of the Delta Lake priorities spanning July to December 2022. With the release of Delta Lake 2.0, we wanted to take the opportunity to discuss other vital features for prioritization with the community based on the feedback from the Delta Users Slack, Google Groups, Community AMAs (on Delta Lake YouTube), the Roadmap 2022H2 (discussion), and more.

Note, tasks that are crossed out (i.e., 00) have been completed.

To review the Delta Rust roadmap only, please refer to https://go.delta.io/rust-roadmap for more information.

Priority 0

We will focus on these issues and continue to deliver parts (or all of the issue) over the next six months

Issue Category Task Description Status
256 Flink Flink Source Build Flink source to read Delta tables in batch and streaming jobs In Progress
238 Flink Flink SQL+ Table API + Catalog Support After Flink Sink and Source, build support for Flink Catalog, SQL, and Table API In Progress
411, 410 Flink Productionize support for all cloud object stores Make sure that Flink Sink can write robustly to S3, GCS, ADLS2 with full transactional guarantees In Progress
610 Rust Integrate with a common object-store abstraction from arrow / Rust ecosystem This will allow us to provide a more convenient and performant API on the Rust and python side In Progress
575 Rust Support V2 writer protocol Utilize PyArrow-based writer function (write_deltalake) support writer protocol V2 and object stores S3, GCS, and ADLS2. In Progress
761 Rust Expand write support for cloud object stores Write to object stores S3, GCS, and ADLS2 from multiple clusters with full transactional guarantees In Progress
Rust DAT Integration Delta Acceptance Tests running in CI In Progress
Rust Rust documentation First pass at Rust docs In Progress
Rust Rust blogging Blog post for the Rust community In Progress
632 Rust Commit protocol Fully protocol compliant optimistic commit protocol In Progress
851 Rust Rust writer Refactor Rust writer API to be flexible for others wishing to build upon delta-rs In Progress
1257 Spark Release Delta 2.1 on Apache Spark 3.3 Ensure the latest version of Delta Lake works with the latest version of Apache Spark™ Released in Delta 2.1
1485 Spark Support reading tables with Deletion Vectors Allow reads on tables that have deletion vectors to mark rows in parquet files as removed. Released in Delta 2.3
1408 Spark Support Table Features protocol Upgrade the protocol to use Table Features to specify the features needed to read/write to a table. Released in Delta 2.3
1242 Spark Support time travel SQL syntax Delta currently supports time travel via Python and Scala APIs. We would like to extend support for the SQL syntax VERSION AS OF and TIMESTAMP AS OF in SELECT statements. Released in Delta 2.1
Standalone Extend Delta Standalone for higher protocol versions Extend Delta Standalone to support logs using higher protocol versions and advanced features like constraints, generated columns, column mapping, etc. In Progress
Standalone Expand support for data skipping in Delta Standalone Extend the current data skipping to skip file using column stats and more expressions In Progress
Website Updated Delta Lake documentation Move Delta Lake documentation to the website GitHub repo to allow easier community collaboration In Progress
Website Consolidate all connector documentation Consolidate docs of all connectors in the website Github repo In Progress

Priority 1

We should be able to deliver parts (or all of the issue) over the next six months

Issue Category Task Description Status
4 Core Delta Acceptance Testing (DAT) With various languages interacting with the Delta protocol (e.g., Delta Standalone, Delta Spark, Delta Rust, Trino, etc.), we propose to have the same reference tables and library of reference tests to ensure all Delta APIs remain in compliance. In Progress
1347 Core Support Bloom filters Improve query performance by utilizing bloom filters. The approach is TBD due to recent updates to Apache Parquet to support bloom filters. Not Started
1387 Core Enable Delta clone Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow: deep clones copy over the data from the source and shallow clones do not. Shallow clone is released in Delta 2.3
Delta connectors GoLang Delta connector Support GoLang reading a Delta Lake table natively Not Started
Delta connectors Improve partition filtering in Power BI client Improved partition filtering using built-in UI filters in Power BI Not Started
Delta connectors Pulsar Source connector Support Apache Pulsar reading a Delta Lake table natively Not Started
Flink Column stats generation in Flink Sink Make the Flink Delta sink generate column stats Not Started
Presto/Trino Support higher protocol versions in Presto and Trino Use Standalone to support higher protocol versions Not Started
Rust Delta Rust API Updates Update APIs and support more high-level operations on top of delta; this includes better conflict resolution NA
Rust Better support for large logs Better support for handling large Delta logs/snapshots NA
Sharing Connectors GoLang Delta Sharing client Support GoLang client for Delta Sharing NA
Sharing Connectors R Delta Sharing client Support R client for Delta Sharing NA
1072 Spark Support for Identity columns Create an identity column that will be automatically assigned a unique and statistically increasing (or decreasing if the step is negative) value. Not Started
Spark Support querying Change Data Feed (CDF) using SQL queries To support querying CDF using SQL queries in Apache Spark, we need to allow custom TVFs to be resolved using injected rules. Released in Delta 2.3
1156 Spark Support Auto Compaction Provide auto compaction functionality to simplify compaction tasks In Progress
1198 Spark Support Optimize Writes Optimize Spark to Delta Lake writes In Progress
1462 Spark Enable converting from Iceberg to Delta Enable converting parquet-backed Iceberg tables to Delta tables without rewriting parquet files. Released in Delta 2.3
1464 Spark Shallow clone Iceberg tables Enable shallow cloning parquet-backed Iceberg tables following the Delta protocols without the need to copy all of the data. Released in Delta 2.3
1349 Spark Improve semantics of column mapping and Change Data Feed Improve semantics of how column renames/drops (aka column mapping) interact with CDF and streaming Released in Delta 2.3

Priority 2

Nice to have

Issue Category Task Description Status
Sharing Share individual partitions Support Sharing individual partitions in Delta Sharing NA
Sharing Connectors Rust Delta Sharing client Support Rust client for Delta Sharing NA
Sharing Connectors Starburst/Trino Delta Sharing connector Support Starburst/Trino client for Delta Sharing NA
Sharing Connectors Airflow Delta Sharing connector Support sharing data from Airflow sensor NA
Rust Process Release improvements NA

History

  • 2022-08-01: Initial creation
  • 2022-08-02: Delta Sharing updates
  • 2022-08-08: Include Identity columns in the roadmap
  • 2022-09-13: Update issues and include into roadmap auto compaction, optimize writes, and bloom filters.
  • 2022-09-19: Update to include Delta Clone
  • 2022-09-22: Including working Delta Rust roadmap document
  • 2022-10-26: Included updated Delta Rust roadmap in GitHub link
  • 2022-10-27: Included converting and shallow cloning Iceberg to Delta
@dennyglee dennyglee added the enhancement New feature or request label Aug 2, 2022
@dennyglee
Copy link
Contributor Author

Note, we will be adding/updating the issue over the next few weeks but I'm a little behind schedule so thought I would get the roadmap discussion started ASAP. Thanks!

@edfreeman
Copy link

Hi folks. Can't see it explicitly mentioned so thought I'd ask - will identity columns support (i.e. writer version 6) be added in this H2 wave? That's a big feature we're keen to be able to use outside of Databricks, and it didn't quite make it into 2.0 by the looks of things.

@dennyglee
Copy link
Contributor Author

Hi folks. Can't see it explicitly mentioned so thought I'd ask - will identity columns support (i.e. writer version 6) be added in this H2 wave? That's a big feature we're keen to be able to use outside of Databricks, and it didn't quite make it into 2.0 by the looks of things.

Thanks for the call out @edfreeman - identity columns has been added :)

@sezruby
Copy link
Contributor

sezruby commented Aug 18, 2022

Hi @dennyglee, what about Auto compaction and Optimize Write? I don't think the PRs are getting some attention for review / merge. Could you add it to the roadmap?

@dennyglee
Copy link
Contributor Author

dennyglee commented Aug 18, 2022 via email

@keen85
Copy link

keen85 commented Aug 19, 2022

Hi @dennyglee, what about support for displaying DDL of delta tables (SHOW CREATE TABLE)
#1032
#1255

@dennyglee
Copy link
Contributor Author

Hi @dennyglee, what about support for displaying DDL of delta tables (SHOW CREATE TABLE) #1032 #1255

Good call out @keen85 - let me check with @zpappa on this!

@zpappa
Copy link

zpappa commented Aug 19, 2022

Hi @dennyglee, what about support for displaying DDL of delta tables (SHOW CREATE TABLE) #1032 #1255

Good call out @keen85 - let me check with @zpappa on this!

I have some minor style and test updates for this PR to be considered done, I can finish them today and we can try to pull them Into 2.1

@dudzicp
Copy link

dudzicp commented Aug 25, 2022

How about delta caching that is present on databricks? Is there a plan for such feature?

@tdas
Copy link
Contributor

tdas commented Aug 25, 2022

"Delta caching" is actually a Databricks Runtime engine feature, not part of the format. Caching data on an processing engine's executor/workers nodes is something that can really be done well by the engine itself, not by a data format. It's unfortunate and confusing that we had marketed it under the "Delta" brand name, even though it's really not part of the "Delta Lake" storage format. So, in short, its not really possible to open source that as part of Delta Lake.

@djouallah
Copy link

djouallah commented Aug 26, 2022

I Have being experimenting with Delta lake in Google Cloud and DuckDB and it is very promising, but without a local SSD cache it will never be fast enough, maybe we need a cache for Delta independently from Databricks implementation, Delta knows which files needs to be scanned, keeping a local copy on the first call will be really useful, at least for a the standalone reader

@khwj
Copy link

khwj commented Aug 28, 2022

Hi @dennyglee, what about supporting analyze table #581 and Bloom filter indexes#1347?

@MaksGS09
Copy link

Hi!
How about Big Query integration?

@sezruby
Copy link
Contributor

sezruby commented Sep 1, 2022

Hi @dennyglee any update?

@dennyglee
Copy link
Contributor Author

Sorry about that @sezruby - yes, we will be adding these to the roadmap very shortly. Thanks for your patience (I’ve been out the last two weeks)

@dennyglee
Copy link
Contributor Author

Some quick updates:

HTH!

@p2bauer
Copy link

p2bauer commented Sep 18, 2022

Hi @dennyglee ! I know it was on the 2022 H1 github page, but I haven't seen any mention on clone functionality being moved into the open source library. Is there any update around that? I poked around the current source code but didn't really see it anywhere.

@dennyglee
Copy link
Contributor Author

Thanks @p2bauer - great call out. I've added this to the roadmap and created issue #1387 to track this. HTH!

@SanthoshPrasad
Copy link

Hi @dennyglee , Is there any update on supporting higher protocol versions in Presto and Trino?

@dennyglee
Copy link
Contributor Author

Great question @SanthoshPrasad - we've been working with the PrestoDB and the Trino communities on this and we should have some updates on various progress around this over the next couple of months. One of the methods we're doing this is through our DAT effort (Delta Acceptance Testing) so we can more cleanly document and clarify which APIs are on which protocol version. If you're interested in learning more on this, please join us in the #dat channel in Delta Users Slack. HTH!

@dennyglee
Copy link
Contributor Author

Suggest we add Airbyte Destination S3: add delta lake/delta table support to the roadmap as it's already part of the Delta Rust Roadmap - WDYT?

@melin
Copy link

melin commented Oct 27, 2022

Support jdbc catalog
#1459

@keen85
Copy link

keen85 commented Oct 30, 2022

I'd like to suggest adding "Register VACUUM in delta log" to the roadmap

@dudzicp
Copy link

dudzicp commented Nov 9, 2022

I know that each commit, min/max values are calculated for each parquet file and are present in the delta log json, but how about adding more granularity to existing data skipping mechanism, by using parquet page skipping?
Relevant links:

Would this be doable?

@dennyglee
Copy link
Contributor Author

dennyglee commented Nov 15, 2022

I know that each commit, min/max values are calculated for each parquet file and are present in the delta log json, but how about adding more granularity to existing data skipping mechanism, by using parquet page skipping? Relevant links:

Would this be doable?

@dudzicp Oh, could you please create a separate issue for this and we can discuss the specifics there? Thanks!

@dudzicp
Copy link

dudzicp commented Feb 10, 2023

How about bucketing?

@benbauer89
Copy link

Hey, I am interested in more details regarding https://delta.io/sharing/ It's stated that presto and trino are coming soon, but I could not really find any details or timelines. Please notice, that I am asking regaring delta sharing in context of Uniyt Catalog in particular and not necessarily regarding delta & trino/presto integration

@dennyglee
Copy link
Contributor Author

Ahh, for Delta Sharing features within the context of UC, please ping Databricks community. Thanks!

@robertkossendey
Copy link

Hey @dennyglee any updates on the Roadmap? :) I was creating some issues in the mack project (e.g. Python support for table property update) but wanted to make sure that the delta-spark team is not working on the things that I came up with already.

@dennyglee
Copy link
Contributor Author

Thanks for your patience @robertkossendey - we're working on this but admittedly way behind schedule due to all of the various asks, eh?! Saying this, please continue working on mack project activities as those are the ones we're pretty sure make more sense for mack to address or at least if we plan to merge this into delta-spark, it'll be further out on the roadmap. Thanks for the ping, eh?!

@robertkossendey
Copy link

Good to know, thank you for the update @dennyglee!

@felipepessoto
Copy link
Contributor

@dennyglee @allisonport-db do you have any updates on auto compact and optimize write?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests