Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap 2021 H2 (discussion) #748

Closed
dennyglee opened this issue Aug 10, 2021 · 28 comments
Closed

Roadmap 2021 H2 (discussion) #748

dennyglee opened this issue Aug 10, 2021 · 28 comments
Labels
enhancement New feature or request
Milestone

Comments

@dennyglee
Copy link
Contributor

dennyglee commented Aug 10, 2021

This is the proposed Delta Lake 2021 H2 roadmap discussion thread. Below are the initial proposed items for the roadmap to be completed by December 2021. We will also be sending out a survey (we will update this issue with the survey) to get more feedback from the Delta Lake community!

Issue Description Target CY2021
#731 Improve Delta protocol to support changes such as column drop and rename Q3
#732 Support Spark’s column drop and rename commands Q3
#101 Streaming enhancements to the standalone reader to support Pulsar, Flink Q3
#85 Delta Standalone Writer: This feature will allow other connectors such as Flink, Kafka, and Pulsar to write to Delta. Q4
#733 Support Apache Spark 3.2 Q4
#110 Delta Source for Apache Flink: Build a Flink/Delta source (i.e., Flink reads from Delta Lake) potentially leveraging the Delta Standalone Reader. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays. CY2022 Q1
#111 Delta Sink for Apache Flink: Build a Flink/Delta sink (i.e., Flink writes to Delta Lake) potentially leveraging the Delta Standalone Writer. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays. Q4
#82 Delta Source for Trino: Build a Trino/Delta reader, potentially leveraging the Delta Standalone Reader. This is a community effort and all are welcome! Join us via the Delta User Slack channel #trino channel and we will have bi-weekly meetings on Thursdays. Q3
#338 Delta Rust API: Formally verify S3 multi-writer design using stateright Q4
#339 Delta Rust API: Low level API for creating new delta tables Q3
#545 Nessie / Delta Integration: Build tighter integration between Nessie and Delta to allow for Nessie’s Git-like experience for data lakes to work with Delta Lake. This is a community effort and all are welcome! Join us via the Delta User Slack channel #nessie channel and we have bi-weekly meetings on Tuesdays. Q4
LakeFS / Delta Integration: Build tighter integration between Nessie and Delta to allow for Nessie’s Git-like experience for data lakes to work with Delta Lake. This is a community effort and all are welcome! Join us via the Delta User Slack channel #lakefs channel and we will have bi-weekly meetings soon. Q4
#112 Delta Source for Apache Pulsar: Build a Pulsar/Delta reader, potentially leveraging the Delta Standalone Reader. Join us via the Delta Users Slack connector-pulsar channel. Q3
#94 Power BI Connector: Fix issue with data sources that do not support streaming of binary files Q3
#103 Power BI Connector: Add inline-documentation to PQ function Q3
#104 PowerBI: Add support for TIMESTAMP AS OF Q4
#36, #116 Update the existing Hive 2 connector ala Delta Standalone Reader to support Hive 3. Q3
#746 Restructure delta.io website: Update delta.io website to allow for community blogs, include top community contributors, updated how-to-contribute guide and place the code-base into GitHub. Q3
#747 Delta Guide: Update the Delta documentation to include a Delta guide. Q4

If there are other issue that should be considered within this roadmap, let's have a discussion here or via the Delta Users Slack #deltalake-oss channel.

@dennyglee dennyglee pinned this issue Aug 10, 2021
@dennyglee dennyglee modified the milestones: 1.1.0, Future Roadmap Aug 14, 2021
@melin
Copy link

melin commented Aug 14, 2021

Open-source version, consider supporting OPTIMIZE ZORDER BY?

@dennyglee
Copy link
Contributor Author

@melin This is a great idea and definitely something we're considering!

@melin
Copy link

melin commented Aug 28, 2021

support show partitions tablename sql

@chengat1314
Copy link

@dennyglee
can we consider adding delta writer support in Trino(Presto)?
the use case for delta writer support in Trino(Presto) mainly for CTAS(CREATE TABLE AS SELECT query).
CTAS contributed more than 30% of our Trino(Presto) workload.

@nicknezis
Copy link

I would love to add a Delta Source and Sink for Apache Heron (perhaps we can collaborate with the equivalent Flink work).

@dennyglee
Copy link
Contributor Author

I would love to add a Delta Source and Sink for Apache Heron (perhaps we can collaborate with the equivalent Flink work).

Absolutely! Please ping me via the Delta Users Slack channel and let's find a time to chat on this, eh?! Glad to help see if we can leverage existing work for Apache Heron.

@YannByron
Copy link
Contributor

Merge-On-Read Mode?

@YannByron
Copy link
Contributor

Will Index mechanism be considered?For columns specified by user, build index to accelerate query/update/delete operation.

@YannByron
Copy link
Contributor

any possible to use maven to manage project instead sbt ? ^ . ^

@ericbellet
Copy link

Hi guys, I have a question related to the roadmap of CDF. When will be published as open-source? Thanks in advance.

@gauravbrills
Copy link

Are there any plans to Open source FSCK , its a pain otherwise to repair large tables if you accidently delete something in s3 .

@dennyglee
Copy link
Contributor Author

Hi @gauravbrills we had not considered open-sourcing FSCK due to the limited asks for this particular functionality. Saying this, while there is certainly value to this, perhaps we can chat on the Delta Users slack about the deletion scenario you are running into. Thanks!

@gauravbrills
Copy link

Hi @gauravbrills we had not considered open-sourcing FSCK due to the limited asks for this particular functionality. Saying this, while there is certainly value to this, perhaps we can chat on the Delta Users slack about the deletion scenario you are running into. Thanks!

Sure thanks will check there .. For now just did a delete of that partition and loaded .

@ashokblend
Copy link

ashokblend commented Oct 21, 2021

@dennyglee can we consider stats collection of delta lake files, for dataskipping, as part of this roadmap.

@dennyglee
Copy link
Contributor Author

@dennyglee can we consider stats collection of delta like files, for dataskipping, as part of this roadmap.

Great call out @ashokblend - we will consider it though cannot commit to this yet as we need prioritize / capacity plan. Consider voting @ashokblend 's comment so we can better ascertain your asks, eh?!

@cosmincatalin
Copy link

With Spark 3.2 available in Databricks, shouldn't support for it be considered sooner as it hinders upgrading?

@dennyglee
Copy link
Contributor Author

With Spark 3.2 available in Databricks, shouldn't support for it be considered sooner as it hinders upgrading?

HI @cosmincatalin - yes, per #733 we are actively working on this and will update these threads as soon as we determine the timeline for Delta 1.1. HTH!

@felipecoxa
Copy link

Have you any plans/date to supporting OPTIMIZE ZORDER BY in Open-source version?

@dennyglee
Copy link
Contributor Author

Hey @felipecoxa - yes, this is something we're definitely considering. Due to the amount of work this would entail, we're still determining the timeline on when we could work on this.

@pdonath
Copy link

pdonath commented Nov 16, 2021

Do you have plans to implement mulit-part checkpoint writing in the open-source Delta Lake? It is defined in the protocol and works for Databricks version, but as I see, open-source org.apache.spark.sql.delta.Checkpoints always writes a checkpoint to a single file.

@dennyglee
Copy link
Contributor Author

Do you have plans to implement mulit-part checkpoint writing in the open-source Delta Lake? It is defined in the protocol and works for Databricks version, but as I see, open-source org.apache.spark.sql.delta.Checkpoints always writes a checkpoint to a single file.

Great question @pdonath - we don't have any plans to address multi-part checkpoint writing yet. Saying this, could you please provide the context on your scenario so we can get feedback from the community for prioritization purposes. If you prefer, you can also slack me directly on the Delta Users slack. HTH!

@pdonath
Copy link

pdonath commented Nov 18, 2021

Do you have plans to implement mulit-part checkpoint writing in the open-source Delta Lake? It is defined in the protocol and works for Databricks version, but as I see, open-source org.apache.spark.sql.delta.Checkpoints always writes a checkpoint to a single file.

Great question @pdonath - we don't have any plans to address multi-part checkpoint writing yet. Saying this, could you please provide the context on your scenario so we can get feedback from the community for prioritization purposes. If you prefer, you can also slack me directly on the Delta Users slack. HTH!

Thank you @dennyglee for the answer. I use Spark structured streaming with Delta Lake for aggregating events in time. In my case, time of a single Spark micro batch should be more less between 15 seconds and 2 minutes. After a few months of working, Delta Lake checkpoints are a bottleneck. The size of a single checkpoint is ~120 MB (after files compaction). When I see into a micro batch details I can see:

  • 20-40 seconds takes processing my data (including reading and writing)
  • 30 seconds takes reading the latest checkpoint
  • 40 seconds takes writing a new checkpoint every 10 micro batches

I decreased parquet row group size and it makes reading the latest checkpoint faster (it can be better parallelized). Now it takes ~10 seconds (20% of the whole micro batch is not perfect, but better). However, I'm not able to do anything to improve writing a new checkpoint. Writing multi-part checkpoints would probably help, especially if I was able to control somehow the number of parts.

@praateekmahajan
Copy link

Ability to perform custom Add and/or Delete commits directly in the Delta Log. This Slack thread on Delta Slack Community has some more context.

@dennyglee
Copy link
Contributor Author

Ability to perform custom Add and/or Delete commits directly in the Delta Log. This Slack thread on Delta Slack Community has some more context.

Thanks @praateekmahajan - could you please create an issue in this GitHub repo so that we can discuss this more fully? Thanks!

@sa255304
Copy link

sa255304 commented Jan 8, 2022

@dennyglee : Could you guys consider adding Spark's Dynamic partition overwrite functionality to Delta Lake as well.

It is very important feature while running backfills for batch jobs. replaceWhere always requires a condition.

@dennyglee
Copy link
Contributor Author

Thanks @sa255304 - we will be publishing the proposed 2022H1 roadmap by the end of the month. Will definitely take into account of your request. Saying this, I'm curious - would the Delta Lake 1.1 arbitrary replaceWhere help for your scenario?

@sa255304
Copy link

sa255304 commented Jan 9, 2022

@dennyglee : Thanks for considering the request. No arbitrary repalceWhere also does n't help. As it still requires me to write a condition.

@dennyglee
Copy link
Contributor Author

Closing this issue as we can begin discussions in #920 - thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests