-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roadmap 2021 H2 (discussion) #748
Comments
Open-source version, consider supporting OPTIMIZE ZORDER BY? |
@melin This is a great idea and definitely something we're considering! |
support show partitions tablename sql |
@dennyglee |
I would love to add a Delta Source and Sink for Apache Heron (perhaps we can collaborate with the equivalent Flink work). |
Absolutely! Please ping me via the Delta Users Slack channel and let's find a time to chat on this, eh?! Glad to help see if we can leverage existing work for Apache Heron. |
Merge-On-Read Mode? |
Will Index mechanism be considered?For columns specified by user, build index to accelerate query/update/delete operation. |
any possible to use maven to manage project instead sbt ? ^ . ^ |
Hi guys, I have a question related to the roadmap of CDF. When will be published as open-source? Thanks in advance. |
Are there any plans to Open source FSCK , its a pain otherwise to repair large tables if you accidently delete something in s3 . |
Hi @gauravbrills we had not considered open-sourcing FSCK due to the limited asks for this particular functionality. Saying this, while there is certainly value to this, perhaps we can chat on the Delta Users slack about the deletion scenario you are running into. Thanks! |
Sure thanks will check there .. For now just did a delete of that partition and loaded . |
@dennyglee can we consider stats collection of delta lake files, for dataskipping, as part of this roadmap. |
Great call out @ashokblend - we will consider it though cannot commit to this yet as we need prioritize / capacity plan. Consider voting @ashokblend 's comment so we can better ascertain your asks, eh?! |
With Spark 3.2 available in Databricks, shouldn't support for it be considered sooner as it hinders upgrading? |
HI @cosmincatalin - yes, per #733 we are actively working on this and will update these threads as soon as we determine the timeline for Delta 1.1. HTH! |
Have you any plans/date to supporting OPTIMIZE ZORDER BY in Open-source version? |
Hey @felipecoxa - yes, this is something we're definitely considering. Due to the amount of work this would entail, we're still determining the timeline on when we could work on this. |
Do you have plans to implement mulit-part checkpoint writing in the open-source Delta Lake? It is defined in the protocol and works for Databricks version, but as I see, open-source |
Great question @pdonath - we don't have any plans to address multi-part checkpoint writing yet. Saying this, could you please provide the context on your scenario so we can get feedback from the community for prioritization purposes. If you prefer, you can also slack me directly on the Delta Users slack. HTH! |
Thank you @dennyglee for the answer. I use Spark structured streaming with Delta Lake for aggregating events in time. In my case, time of a single Spark micro batch should be more less between 15 seconds and 2 minutes. After a few months of working, Delta Lake checkpoints are a bottleneck. The size of a single checkpoint is ~120 MB (after files compaction). When I see into a micro batch details I can see:
I decreased parquet row group size and it makes reading the latest checkpoint faster (it can be better parallelized). Now it takes ~10 seconds (20% of the whole micro batch is not perfect, but better). However, I'm not able to do anything to improve writing a new checkpoint. Writing multi-part checkpoints would probably help, especially if I was able to control somehow the number of parts. |
Ability to perform custom Add and/or Delete commits directly in the Delta Log. This Slack thread on Delta Slack Community has some more context. |
Thanks @praateekmahajan - could you please create an issue in this GitHub repo so that we can discuss this more fully? Thanks! |
@dennyglee : Could you guys consider adding Spark's Dynamic partition overwrite functionality to Delta Lake as well. It is very important feature while running backfills for batch jobs. replaceWhere always requires a condition. |
Thanks @sa255304 - we will be publishing the proposed 2022H1 roadmap by the end of the month. Will definitely take into account of your request. Saying this, I'm curious - would the Delta Lake 1.1 arbitrary |
@dennyglee : Thanks for considering the request. No arbitrary |
Closing this issue as we can begin discussions in #920 - thanks! |
This is the proposed Delta Lake 2021 H2 roadmap discussion thread. Below are the initial proposed items for the roadmap to be completed by December 2021. We will also be sending out a survey (we will update this issue with the survey) to get more feedback from the Delta Lake community!
If there are other issue that should be considered within this roadmap, let's have a discussion here or via the Delta Users Slack #deltalake-oss channel.
The text was updated successfully, but these errors were encountered: