Roadmap 2021 H2 (discussion) #748

dennyglee · 2021-08-10T17:40:14Z

This is the proposed Delta Lake 2021 H2 roadmap discussion thread. Below are the initial proposed items for the roadmap to be completed by December 2021. We will also be sending out a survey (we will update this issue with the survey) to get more feedback from the Delta Lake community!

Issue	Description	Target CY2021
#731	Improve Delta protocol to support changes such as column drop and rename	Q3
#732	Support Spark’s column drop and rename commands	Q3
#101	Streaming enhancements to the standalone reader to support Pulsar, Flink	Q3
#85	Delta Standalone Writer: This feature will allow other connectors such as Flink, Kafka, and Pulsar to write to Delta.	Q4
#733	Support Apache Spark 3.2	Q4
#110	Delta Source for Apache Flink: Build a Flink/Delta source (i.e., Flink reads from Delta Lake) potentially leveraging the Delta Standalone Reader. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays.	CY2022 Q1
#111	Delta Sink for Apache Flink: Build a Flink/Delta sink (i.e., Flink writes to Delta Lake) potentially leveraging the Delta Standalone Writer. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays.	Q4
#82	Delta Source for Trino: Build a Trino/Delta reader, potentially leveraging the Delta Standalone Reader. This is a community effort and all are welcome! Join us via the Delta User Slack channel #trino channel and we will have bi-weekly meetings on Thursdays.	Q3
#338	Delta Rust API: Formally verify S3 multi-writer design using stateright	Q4
#339	Delta Rust API: Low level API for creating new delta tables	Q3
#545	Nessie / Delta Integration: Build tighter integration between Nessie and Delta to allow for Nessie’s Git-like experience for data lakes to work with Delta Lake. This is a community effort and all are welcome! Join us via the Delta User Slack channel #nessie channel and we have bi-weekly meetings on Tuesdays.	Q4
	LakeFS / Delta Integration: Build tighter integration between Nessie and Delta to allow for Nessie’s Git-like experience for data lakes to work with Delta Lake. This is a community effort and all are welcome! Join us via the Delta User Slack channel #lakefs channel and we will have bi-weekly meetings soon.	Q4
#112	Delta Source for Apache Pulsar: Build a Pulsar/Delta reader, potentially leveraging the Delta Standalone Reader. Join us via the Delta Users Slack connector-pulsar channel.	Q3
#94	Power BI Connector: Fix issue with data sources that do not support streaming of binary files	Q3
#103	Power BI Connector: Add inline-documentation to PQ function	Q3
#104	PowerBI: Add support for TIMESTAMP AS OF	Q4
#36, #116	Update the existing Hive 2 connector ala Delta Standalone Reader to support Hive 3.	Q3
#746	Restructure delta.io website: Update delta.io website to allow for community blogs, include top community contributors, updated how-to-contribute guide and place the code-base into GitHub.	Q3
#747	Delta Guide: Update the Delta documentation to include a Delta guide.	Q4

If there are other issue that should be considered within this roadmap, let's have a discussion here or via the Delta Users Slack #deltalake-oss channel.

melin · 2021-08-14T03:46:22Z

Open-source version, consider supporting OPTIMIZE ZORDER BY?

dennyglee · 2021-08-24T00:23:32Z

@melin This is a great idea and definitely something we're considering!

melin · 2021-08-28T05:40:49Z

support show partitions tablename sql

chengat1314 · 2021-08-31T03:56:59Z

@dennyglee
can we consider adding delta writer support in Trino(Presto)?
the use case for delta writer support in Trino(Presto) mainly for CTAS(CREATE TABLE AS SELECT query).
CTAS contributed more than 30% of our Trino(Presto) workload.

nicknezis · 2021-09-03T06:29:27Z

I would love to add a Delta Source and Sink for Apache Heron (perhaps we can collaborate with the equivalent Flink work).

dennyglee · 2021-09-03T15:14:07Z

I would love to add a Delta Source and Sink for Apache Heron (perhaps we can collaborate with the equivalent Flink work).

Absolutely! Please ping me via the Delta Users Slack channel and let's find a time to chat on this, eh?! Glad to help see if we can leverage existing work for Apache Heron.

YannByron · 2021-09-14T08:55:47Z

Merge-On-Read Mode?

YannByron · 2021-09-14T09:01:02Z

Will Index mechanism be considered？For columns specified by user, build index to accelerate query/update/delete operation.

YannByron · 2021-09-14T12:54:23Z

any possible to use maven to manage project instead sbt ? ^ . ^

ericbellet · 2021-09-16T07:19:03Z

Hi guys, I have a question related to the roadmap of CDF. When will be published as open-source? Thanks in advance.

gauravbrills · 2021-09-30T03:07:32Z

Are there any plans to Open source FSCK , its a pain otherwise to repair large tables if you accidently delete something in s3 .

dennyglee · 2021-10-01T23:14:26Z

Hi @gauravbrills we had not considered open-sourcing FSCK due to the limited asks for this particular functionality. Saying this, while there is certainly value to this, perhaps we can chat on the Delta Users slack about the deletion scenario you are running into. Thanks!

gauravbrills · 2021-10-08T17:54:53Z

Hi @gauravbrills we had not considered open-sourcing FSCK due to the limited asks for this particular functionality. Saying this, while there is certainly value to this, perhaps we can chat on the Delta Users slack about the deletion scenario you are running into. Thanks!

Sure thanks will check there .. For now just did a delete of that partition and loaded .

ashokblend · 2021-10-21T05:12:04Z

@dennyglee can we consider stats collection of delta lake files, for dataskipping, as part of this roadmap.

dennyglee · 2021-10-26T01:52:49Z

@dennyglee can we consider stats collection of delta like files, for dataskipping, as part of this roadmap.

Great call out @ashokblend - we will consider it though cannot commit to this yet as we need prioritize / capacity plan. Consider voting @ashokblend 's comment so we can better ascertain your asks, eh?!

cosmincatalin · 2021-10-28T09:11:10Z

With Spark 3.2 available in Databricks, shouldn't support for it be considered sooner as it hinders upgrading?

dennyglee · 2021-10-28T15:13:29Z

With Spark 3.2 available in Databricks, shouldn't support for it be considered sooner as it hinders upgrading?

HI @cosmincatalin - yes, per #733 we are actively working on this and will update these threads as soon as we determine the timeline for Delta 1.1. HTH!

felipecoxa · 2021-11-10T15:51:50Z

Have you any plans/date to supporting OPTIMIZE ZORDER BY in Open-source version?

dennyglee · 2021-11-11T04:26:34Z

Hey @felipecoxa - yes, this is something we're definitely considering. Due to the amount of work this would entail, we're still determining the timeline on when we could work on this.

pdonath · 2021-11-16T07:30:10Z

Do you have plans to implement mulit-part checkpoint writing in the open-source Delta Lake? It is defined in the protocol and works for Databricks version, but as I see, open-source org.apache.spark.sql.delta.Checkpoints always writes a checkpoint to a single file.

dennyglee · 2021-11-16T18:14:10Z

Do you have plans to implement mulit-part checkpoint writing in the open-source Delta Lake? It is defined in the protocol and works for Databricks version, but as I see, open-source org.apache.spark.sql.delta.Checkpoints always writes a checkpoint to a single file.

Great question @pdonath - we don't have any plans to address multi-part checkpoint writing yet. Saying this, could you please provide the context on your scenario so we can get feedback from the community for prioritization purposes. If you prefer, you can also slack me directly on the Delta Users slack. HTH!

pdonath · 2021-11-18T07:24:33Z

Do you have plans to implement mulit-part checkpoint writing in the open-source Delta Lake? It is defined in the protocol and works for Databricks version, but as I see, open-source org.apache.spark.sql.delta.Checkpoints always writes a checkpoint to a single file.

Great question @pdonath - we don't have any plans to address multi-part checkpoint writing yet. Saying this, could you please provide the context on your scenario so we can get feedback from the community for prioritization purposes. If you prefer, you can also slack me directly on the Delta Users slack. HTH!

Thank you @dennyglee for the answer. I use Spark structured streaming with Delta Lake for aggregating events in time. In my case, time of a single Spark micro batch should be more less between 15 seconds and 2 minutes. After a few months of working, Delta Lake checkpoints are a bottleneck. The size of a single checkpoint is ~120 MB (after files compaction). When I see into a micro batch details I can see:

20-40 seconds takes processing my data (including reading and writing)
30 seconds takes reading the latest checkpoint
40 seconds takes writing a new checkpoint every 10 micro batches

I decreased parquet row group size and it makes reading the latest checkpoint faster (it can be better parallelized). Now it takes ~10 seconds (20% of the whole micro batch is not perfect, but better). However, I'm not able to do anything to improve writing a new checkpoint. Writing multi-part checkpoints would probably help, especially if I was able to control somehow the number of parts.

praateekmahajan · 2021-11-23T05:31:21Z

Ability to perform custom Add and/or Delete commits directly in the Delta Log. This Slack thread on Delta Slack Community has some more context.

dennyglee · 2021-11-23T18:43:23Z

Ability to perform custom Add and/or Delete commits directly in the Delta Log. This Slack thread on Delta Slack Community has some more context.

Thanks @praateekmahajan - could you please create an issue in this GitHub repo so that we can discuss this more fully? Thanks!

sa255304 · 2022-01-08T20:46:40Z

@dennyglee : Could you guys consider adding Spark's Dynamic partition overwrite functionality to Delta Lake as well.

It is very important feature while running backfills for batch jobs. replaceWhere always requires a condition.

dennyglee · 2022-01-08T23:32:31Z

Thanks @sa255304 - we will be publishing the proposed 2022H1 roadmap by the end of the month. Will definitely take into account of your request. Saying this, I'm curious - would the Delta Lake 1.1 arbitrary replaceWhere help for your scenario?

sa255304 · 2022-01-09T00:32:45Z

@dennyglee : Thanks for considering the request. No arbitrary repalceWhere also does n't help. As it still requires me to write a condition.

dennyglee · 2022-02-03T03:58:43Z

Closing this issue as we can begin discussions in #920 - thanks!

dennyglee pinned this issue Aug 10, 2021

dennyglee modified the milestones: 1.1.0, Future Roadmap Aug 14, 2021

dennyglee added the enhancement New feature or request label Oct 11, 2021

dennyglee mentioned this issue Oct 12, 2021

Enable Change Data Feed #704

Closed

zsxwing mentioned this issue Oct 20, 2021

[Feature Request] Support auto compaction #815

Open

dennyglee mentioned this issue Feb 3, 2022

Roadmap 2022 H1 (discussion) #920

Closed

dennyglee closed this as completed Feb 3, 2022

dennyglee unpinned this issue Feb 3, 2022

Sovima mentioned this issue Jul 29, 2024

[Feature Request] FSCK REPAIR TABLE sql command #3436

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap 2021 H2 (discussion) #748

Roadmap 2021 H2 (discussion) #748

dennyglee commented Aug 10, 2021 •

edited

Loading

melin commented Aug 14, 2021

dennyglee commented Aug 24, 2021

melin commented Aug 28, 2021

chengat1314 commented Aug 31, 2021

nicknezis commented Sep 3, 2021

dennyglee commented Sep 3, 2021

YannByron commented Sep 14, 2021

YannByron commented Sep 14, 2021

YannByron commented Sep 14, 2021

ericbellet commented Sep 16, 2021

gauravbrills commented Sep 30, 2021

dennyglee commented Oct 1, 2021

gauravbrills commented Oct 8, 2021

ashokblend commented Oct 21, 2021 •

edited

Loading

dennyglee commented Oct 26, 2021

cosmincatalin commented Oct 28, 2021

dennyglee commented Oct 28, 2021

felipecoxa commented Nov 10, 2021

dennyglee commented Nov 11, 2021

pdonath commented Nov 16, 2021 •

edited

Loading

dennyglee commented Nov 16, 2021

pdonath commented Nov 18, 2021 •

edited

Loading

praateekmahajan commented Nov 23, 2021

dennyglee commented Nov 23, 2021

sa255304 commented Jan 8, 2022

dennyglee commented Jan 8, 2022

sa255304 commented Jan 9, 2022

dennyglee commented Feb 3, 2022

Roadmap 2021 H2 (discussion) #748

Roadmap 2021 H2 (discussion) #748

Comments

dennyglee commented Aug 10, 2021 • edited Loading

melin commented Aug 14, 2021

dennyglee commented Aug 24, 2021

melin commented Aug 28, 2021

chengat1314 commented Aug 31, 2021

nicknezis commented Sep 3, 2021

dennyglee commented Sep 3, 2021

YannByron commented Sep 14, 2021

YannByron commented Sep 14, 2021

YannByron commented Sep 14, 2021

ericbellet commented Sep 16, 2021

gauravbrills commented Sep 30, 2021

dennyglee commented Oct 1, 2021

gauravbrills commented Oct 8, 2021

ashokblend commented Oct 21, 2021 • edited Loading

dennyglee commented Oct 26, 2021

cosmincatalin commented Oct 28, 2021

dennyglee commented Oct 28, 2021

felipecoxa commented Nov 10, 2021

dennyglee commented Nov 11, 2021

pdonath commented Nov 16, 2021 • edited Loading

dennyglee commented Nov 16, 2021

pdonath commented Nov 18, 2021 • edited Loading

praateekmahajan commented Nov 23, 2021

dennyglee commented Nov 23, 2021

sa255304 commented Jan 8, 2022

dennyglee commented Jan 8, 2022

sa255304 commented Jan 9, 2022

dennyglee commented Feb 3, 2022

dennyglee commented Aug 10, 2021 •

edited

Loading

ashokblend commented Oct 21, 2021 •

edited

Loading

pdonath commented Nov 16, 2021 •

edited

Loading

pdonath commented Nov 18, 2021 •

edited

Loading