Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md #5

Closed

Conversation

databricks-david-lewis
Copy link
Contributor

Minor typo

…ting rewrite of large partition into multiple jobs

This PR aims to alleviate some pain points of using ZOrder in production. Specifically around dealing with large amounts of data.

See these tickets for background motivation:
https://databricks.atlassian.net/browse/SC-13048
https://databricks.atlassian.net/browse/SC-15987

## What changes were proposed in this pull request?
This PR adds one new configuration option for controlling the target size of Z-Cubes.
It changes the default for the delta max file size to be 20% larger than the default for min file size.

This PR reimagines `OptimizeTask` as a trait with two implementations, one for ZOrder and one for Compaction. The `OptimizeTableCommand.execute` method now takes a sequence of `OptimizeTask`s and executes them, 15 at a time. Previously the OptimizeTask was a sequence of files that would be interpreted differently based on whether we are performing a compaction or z-order optimize. In the Z-Order case each task represented an unbounded amount of data, generally of size 100 G, while in the Compaction case each task was 1 gigabyte or less. In both cases we would attempt to perform 400 of these tasks in one spark job, and 15 of these jobs at a time.
This new strategy expands the OptimizeTask to encompass one delta transaction, allowing the size of each batch to be configured separately for z-order and compaction.

## How was this patch tested?
Unit tests to verify that we break lots of files up into multiple tasks.
Unit test to verify that we will not create Z-cubes of size less than `optimize.zorder.mergeStrategy.minCubeSize.threshold` unless we have to.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Authored-by: David Lewis <david.lewis@databricks.com>
Signed-off-by: Mukul Murthy <mukul.murthy@databricks.com>
GitOrigin-RevId: 8a39e20766696126aa7031ca2eb24d6b480914bb
## What changes were proposed in this pull request?

- Add a NOTICE file contents from Spark per ALv2 requirement
- Symlink NOTICE so it gets built into artifact
- Add Spark LICENSE to LICENSE as possible overkill but complete

## How was this patch tested?

N/A

Closes #4951 from srowen/LicenseNotice.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
GitOrigin-RevId: 02c33f94b9b26e9dd42cc912205fc3bc45d98a1b
Switch the OSS Delta build to Coursier to speed up dependency resolution.

Closes #4952 from ahirreddy/oss-test-pr-trigger.

Authored-by: Ahir Reddy <ahirreddy@gmail.com>
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
GitOrigin-RevId: e8959d233ef49ac7f506592b6391f09b991ae2eb
…r HDFS

## What changes were proposed in this pull request?

- Implemented HDFSLogStoreImpl for OSS
  - Uses FileContext for all ops and FileContext.rename to ensure atomic writes (atomic for both overwrite = true and false)
- Added a lot of scala docs.

## How was this patch tested?

Fixed unit test to work with both HDFSLogStoreImpl and HDFSLogStore. Test the OSS part by locally running `delta/build/copybara.py --sbt-args "testOnly *LogStore*"`

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
GitOrigin-RevId: 2377995a20aa2013a09fa14a400dd2e9b9a0292d
## What changes were proposed in this pull request?

Self-explanatory

## How was this patch tested?

No need

Closes #4964 from tdas/delta-lake-notic.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
GitOrigin-RevId: c6f90a323f20c65e4e8bca033a760b7823fd08c8
## What changes were proposed in this pull request?

Google Mirror of Maven Central is more stable than Maven Central. Place it first to avoid hitting flaky  Maven Central error.

Closes #4966 from zsxwing/google-maven.

Authored-by: Shixiong Zhu <shixiong@databricks.com>
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>
GitOrigin-RevId: 14deb32c3c6ca2fda6cf8a92b62fe41dead8ad65
## What changes were proposed in this pull request?

- Added script to automatically update any existing license header from to new header
- For a few files, manually updated their DB and Apache license and excluded them the script
- Added new license headers to other non scala files
- Replaced all instances of "Databricks Delta" to "Delta Lake"

## How was this patch tested?

Existing DBR and OSS delta tests

Closes #4970 from tdas/SC-17340.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
GitOrigin-RevId: 0d7eb26b553957af3e732e68bc1f4c269d6823bf
## What changes were proposed in this pull request?

- Update build scripts to match the new created bintray repo.
- Remove unused SBT plugins.
- Fix minor issues in README.

Closes #4971 from zsxwing/delta-oss-cleanup.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Michael Armbrust <michael@databricks.com>
GitOrigin-RevId: e079220522f18c9bb57a0849656b3b504cc02ef6
## What changes were proposed in this pull request?

This PR open sources some time travel tests.
## How was this patch tested?
Copybara'd to open source repo and ran tests

Closes #4909 from brkyvz/ttTests.

Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: liwensun <liwen.sun@databricks.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
GitOrigin-RevId: 853989251f00d43ac71932fc64e25904ff51ba6c
## What changes were proposed in this pull request?

- Upgrade to Spark 2.4.2-rc1.
- Always enable `spark.sql.legacy.sources.write.passPartitionByAsOptions` in `DeltaDataSource` constructor and remove `partitionByHack`.
- Fix filter push down.

## How was this patch tested?

New unit tests.

Closes #4955 from zsxwing/spark2.4.2-rc1.

Lead-authored-by: Shixiong Zhu <shixiong@databricks.com>
Co-authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Michael Armbrust <michael@databricks.com>
GitOrigin-RevId: ac17c3e7cf5e19cebbdbbbea42025fb8bf877f84
GitOrigin-RevId: 123d4b89bb6a88d0c2d224d79ddc572aa1420512
GitOrigin-RevId: 6a748039c56bb56e951cdb6cfff9e2b9dd9ee6e3
* Reduce to ~2g

* Update build.sbt

* Update build.sbt

* Update sbt-launch-lib.bash
## What changes were proposed in this pull request?

Added intro
Fixed license

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
GitOrigin-RevId: d76f6d04ca9f06fdfbea1dab34d44baacbf714c4
…pass tests

## What changes were proposed in this pull request?

Title explains it all

## How was this patch tested?

https://circleci.com/gh/delta-io/delta/6

Closes #4979 from brkyvz/fixSbtChanges.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
GitOrigin-RevId: 5a9be10747151e0ee44c1b02ed42cebb6770d8d9
## What changes were proposed in this pull request?

Update links to use the Delta OSS repo.

Closes #4977 from zsxwing/update-links.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Michael Armbrust <michael@databricks.com>
GitOrigin-RevId: 82976341f2f720e08dd6ed36d0509b7d59c2e4de
## What changes were proposed in this pull request?

Port streaming source tests.

Closes #4913 from jose-torres/oss4.

Lead-authored-by: Jose Torres <joseph.torres@databricks.com>
Co-authored-by: Ahir Reddy <ahirreddy@gmail.com>
Signed-off-by: Jose Torres <joseph.torres@databricks.com>
GitOrigin-RevId: ca7c785107aeafe2f0924749982bf229d6eaac1f
GitOrigin-RevId: cde50a92f7370d429583860788526335ba4a6605
Minor typo
@databricks-cla-assistant
Copy link

databricks-cla-assistant commented Apr 23, 2019

CLA assistant check
Thank you for your submission, we really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
2 out of 9 committers have signed the CLA.

✅ databricks-david-lewis
✅ ahirreddy
❌ srowen
❌ tdas
❌ zsxwing
❌ brkyvz
❌ mukulmurthy
❌ jose-torres
❌ marmbrus
You have signed the CLA already but the status is still pending? Let us recheck it.

@tdas tdas closed this Apr 26, 2019
LantaoJin added a commit to LantaoJin/delta that referenced this pull request Mar 24, 2020
 [CARMEL-1869] Update Carmel Delta Lake to latest version (0.5.0)
rymurr pushed a commit to rymurr/delta that referenced this pull request Sep 30, 2020
…lta-io#5)

* pick workflows onto branch-0.6

* Parse file metadata as a separate task

* change version to distinguish this branch
LantaoJin added a commit to LantaoJin/delta that referenced this pull request Mar 12, 2021
jbguerraz pushed a commit to jbguerraz/delta that referenced this pull request Jul 6, 2022
andreaschat-db added a commit to andreaschat-db/delta that referenced this pull request Apr 23, 2024
# This is the 1st commit message:

flush

# This is the commit message delta-io#2:

flush

# This is the commit message delta-io#3:

First sane version without isRowDeleted

# This is the commit message delta-io#4:

Hack RowIndexMarkingFilters

# This is the commit message delta-io#5:

Add support for non-vectorized readers

# This is the commit message delta-io#6:

Metadata column fix
andreaschat-db added a commit to andreaschat-db/delta that referenced this pull request Apr 23, 2024
# This is the 1st commit message:

flush

# This is the commit message delta-io#2:

flush

# This is the commit message delta-io#3:

First sane version without isRowDeleted

# This is the commit message delta-io#4:

Hack RowIndexMarkingFilters

# This is the commit message delta-io#5:

Add support for non-vectorized readers

# This is the commit message delta-io#6:

Metadata column fix

# This is the commit message delta-io#7:

Avoid non-deterministic UDF to filter deleted rows

# This is the commit message delta-io#8:

metadata with Expression ID

# This is the commit message delta-io#9:

Fix complex views issue

# This is the commit message delta-io#10:

Tests

# This is the commit message delta-io#11:

cleaning

# This is the commit message delta-io#12:

More tests and fixes
andreaschat-db added a commit to andreaschat-db/delta that referenced this pull request Apr 23, 2024
# This is the 1st commit message:

flush

# This is the commit message delta-io#2:

flush

# This is the commit message delta-io#3:

First sane version without isRowDeleted

# This is the commit message delta-io#4:

Hack RowIndexMarkingFilters

# This is the commit message delta-io#5:

Add support for non-vectorized readers

# This is the commit message delta-io#6:

Metadata column fix

# This is the commit message delta-io#7:

Avoid non-deterministic UDF to filter deleted rows

# This is the commit message delta-io#8:

metadata with Expression ID

# This is the commit message delta-io#9:

Fix complex views issue

# This is the commit message delta-io#10:

Tests

# This is the commit message delta-io#11:

cleaning

# This is the commit message delta-io#12:

More tests and fixes

# This is the commit message delta-io#13:

Partial cleaning

# This is the commit message delta-io#14:

cleaning and improvements

# This is the commit message delta-io#15:

cleaning and improvements

# This is the commit message delta-io#16:

Clean RowIndexFilter
andreaschat-db added a commit to andreaschat-db/delta that referenced this pull request Apr 26, 2024
# This is the 1st commit message:

flush

# This is the commit message delta-io#2:

flush

# This is the commit message delta-io#3:

First sane version without isRowDeleted

# This is the commit message delta-io#4:

Hack RowIndexMarkingFilters

# This is the commit message delta-io#5:

Add support for non-vectorized readers

# This is the commit message delta-io#6:

Metadata column fix
andreaschat-db added a commit to andreaschat-db/delta that referenced this pull request Apr 26, 2024
# This is the 1st commit message:

flush

# This is the commit message delta-io#2:

flush

# This is the commit message delta-io#3:

First sane version without isRowDeleted

# This is the commit message delta-io#4:

Hack RowIndexMarkingFilters

# This is the commit message delta-io#5:

Add support for non-vectorized readers

# This is the commit message delta-io#6:

Metadata column fix

# This is the commit message delta-io#7:

Avoid non-deterministic UDF to filter deleted rows

# This is the commit message delta-io#8:

metadata with Expression ID

# This is the commit message delta-io#9:

Fix complex views issue

# This is the commit message delta-io#10:

Tests

# This is the commit message delta-io#11:

cleaning

# This is the commit message delta-io#12:

More tests and fixes
andreaschat-db added a commit to andreaschat-db/delta that referenced this pull request Apr 26, 2024
# This is the 1st commit message:

flush

# This is the commit message delta-io#2:

flush

# This is the commit message delta-io#3:

First sane version without isRowDeleted

# This is the commit message delta-io#4:

Hack RowIndexMarkingFilters

# This is the commit message delta-io#5:

Add support for non-vectorized readers

# This is the commit message delta-io#6:

Metadata column fix

# This is the commit message delta-io#7:

Avoid non-deterministic UDF to filter deleted rows

# This is the commit message delta-io#8:

metadata with Expression ID

# This is the commit message delta-io#9:

Fix complex views issue

# This is the commit message delta-io#10:

Tests

# This is the commit message delta-io#11:

cleaning

# This is the commit message delta-io#12:

More tests and fixes

# This is the commit message delta-io#13:

Partial cleaning

# This is the commit message delta-io#14:

cleaning and improvements

# This is the commit message delta-io#15:

cleaning and improvements

# This is the commit message delta-io#16:

Clean RowIndexFilter
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants