-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup README #4
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ting rewrite of large partition into multiple jobs This PR aims to alleviate some pain points of using ZOrder in production. Specifically around dealing with large amounts of data. See these tickets for background motivation: https://databricks.atlassian.net/browse/SC-13048 https://databricks.atlassian.net/browse/SC-15987 ## What changes were proposed in this pull request? This PR adds one new configuration option for controlling the target size of Z-Cubes. It changes the default for the delta max file size to be 20% larger than the default for min file size. This PR reimagines `OptimizeTask` as a trait with two implementations, one for ZOrder and one for Compaction. The `OptimizeTableCommand.execute` method now takes a sequence of `OptimizeTask`s and executes them, 15 at a time. Previously the OptimizeTask was a sequence of files that would be interpreted differently based on whether we are performing a compaction or z-order optimize. In the Z-Order case each task represented an unbounded amount of data, generally of size 100 G, while in the Compaction case each task was 1 gigabyte or less. In both cases we would attempt to perform 400 of these tasks in one spark job, and 15 of these jobs at a time. This new strategy expands the OptimizeTask to encompass one delta transaction, allowing the size of each batch to be configured separately for z-order and compaction. ## How was this patch tested? Unit tests to verify that we break lots of files up into multiple tasks. Unit test to verify that we will not create Z-cubes of size less than `optimize.zorder.mergeStrategy.minCubeSize.threshold` unless we have to. Please review http://spark.apache.org/contributing.html before opening a pull request. Authored-by: David Lewis <david.lewis@databricks.com> Signed-off-by: Mukul Murthy <mukul.murthy@databricks.com> GitOrigin-RevId: 8a39e20766696126aa7031ca2eb24d6b480914bb
## What changes were proposed in this pull request? - Add a NOTICE file contents from Spark per ALv2 requirement - Symlink NOTICE so it gets built into artifact - Add Spark LICENSE to LICENSE as possible overkill but complete ## How was this patch tested? N/A Closes #4951 from srowen/LicenseNotice. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: 02c33f94b9b26e9dd42cc912205fc3bc45d98a1b
Switch the OSS Delta build to Coursier to speed up dependency resolution. Closes #4952 from ahirreddy/oss-test-pr-trigger. Authored-by: Ahir Reddy <ahirreddy@gmail.com> Signed-off-by: Shixiong Zhu <shixiong@databricks.com> GitOrigin-RevId: e8959d233ef49ac7f506592b6391f09b991ae2eb
…r HDFS ## What changes were proposed in this pull request? - Implemented HDFSLogStoreImpl for OSS - Uses FileContext for all ops and FileContext.rename to ensure atomic writes (atomic for both overwrite = true and false) - Added a lot of scala docs. ## How was this patch tested? Fixed unit test to work with both HDFSLogStoreImpl and HDFSLogStore. Test the OSS part by locally running `delta/build/copybara.py --sbt-args "testOnly *LogStore*"` Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: 2377995a20aa2013a09fa14a400dd2e9b9a0292d
## What changes were proposed in this pull request? Self-explanatory ## How was this patch tested? No need Closes #4964 from tdas/delta-lake-notic. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: c6f90a323f20c65e4e8bca033a760b7823fd08c8
## What changes were proposed in this pull request? Google Mirror of Maven Central is more stable than Maven Central. Place it first to avoid hitting flaky Maven Central error. Closes #4966 from zsxwing/google-maven. Authored-by: Shixiong Zhu <shixiong@databricks.com> Signed-off-by: Shixiong Zhu <shixiong@databricks.com> GitOrigin-RevId: 14deb32c3c6ca2fda6cf8a92b62fe41dead8ad65
## What changes were proposed in this pull request? - Added script to automatically update any existing license header from to new header - For a few files, manually updated their DB and Apache license and excluded them the script - Added new license headers to other non scala files - Replaced all instances of "Databricks Delta" to "Delta Lake" ## How was this patch tested? Existing DBR and OSS delta tests Closes #4970 from tdas/SC-17340. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: 0d7eb26b553957af3e732e68bc1f4c269d6823bf
## What changes were proposed in this pull request? - Update build scripts to match the new created bintray repo. - Remove unused SBT plugins. - Fix minor issues in README. Closes #4971 from zsxwing/delta-oss-cleanup. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Michael Armbrust <michael@databricks.com> GitOrigin-RevId: e079220522f18c9bb57a0849656b3b504cc02ef6
## What changes were proposed in this pull request? This PR open sources some time travel tests. ## How was this patch tested? Copybara'd to open source repo and ran tests Closes #4909 from brkyvz/ttTests. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: liwensun <liwen.sun@databricks.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: 853989251f00d43ac71932fc64e25904ff51ba6c
## What changes were proposed in this pull request? - Upgrade to Spark 2.4.2-rc1. - Always enable `spark.sql.legacy.sources.write.passPartitionByAsOptions` in `DeltaDataSource` constructor and remove `partitionByHack`. - Fix filter push down. ## How was this patch tested? New unit tests. Closes #4955 from zsxwing/spark2.4.2-rc1. Lead-authored-by: Shixiong Zhu <shixiong@databricks.com> Co-authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Michael Armbrust <michael@databricks.com> GitOrigin-RevId: ac17c3e7cf5e19cebbdbbbea42025fb8bf877f84
GitOrigin-RevId: 123d4b89bb6a88d0c2d224d79ddc572aa1420512
GitOrigin-RevId: 6a748039c56bb56e951cdb6cfff9e2b9dd9ee6e3
* Reduce to ~2g * Update build.sbt * Update build.sbt * Update sbt-launch-lib.bash
## What changes were proposed in this pull request? Added intro Fixed license ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review http://spark.apache.org/contributing.html before opening a pull request. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: d76f6d04ca9f06fdfbea1dab34d44baacbf714c4
…pass tests ## What changes were proposed in this pull request? Title explains it all ## How was this patch tested? https://circleci.com/gh/delta-io/delta/6 Closes #4979 from brkyvz/fixSbtChanges. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> GitOrigin-RevId: 5a9be10747151e0ee44c1b02ed42cebb6770d8d9
Remove trailing spaces
|
LantaoJin
added a commit
to LantaoJin/delta
that referenced
this pull request
Mar 24, 2020
rymurr
pushed a commit
to rymurr/delta
that referenced
this pull request
Sep 30, 2020
* Parse file metadata as a separate task * change version to distinguish this branch
rymurr
pushed a commit
to rymurr/delta
that referenced
this pull request
Nov 3, 2020
* Parse file metadata as a separate task * change version to distinguish this branch
rymurr
pushed a commit
to rymurr/delta
that referenced
this pull request
Nov 3, 2020
* Parse file metadata as a separate task * change version to distinguish this branch
rymurr
pushed a commit
to rymurr/delta
that referenced
this pull request
Nov 3, 2020
* Parse file metadata as a separate task * change version to distinguish this branch
LantaoJin
added a commit
to LantaoJin/delta
that referenced
this pull request
Mar 12, 2021
rymurr
pushed a commit
to rymurr/delta
that referenced
this pull request
Aug 25, 2021
* Parse file metadata as a separate task * change version to distinguish this branch
rymurr
pushed a commit
to rymurr/delta
that referenced
this pull request
Aug 25, 2021
* Parse file metadata as a separate task * change version to distinguish this branch * log store chooses where checkpoitns go (delta-io#6) * handle snapshot names (delta-io#9) Signed-off-by: Ryan Murray rymurr@gmail.com
rymurr
pushed a commit
to rymurr/delta
that referenced
this pull request
Aug 25, 2021
* Parse file metadata as a separate task * change version to distinguish this branch * log store chooses where checkpoitns go (delta-io#6) * handle snapshot names (delta-io#9)
jbguerraz
pushed a commit
to jbguerraz/delta
that referenced
this pull request
Jul 6, 2022
update master
tdas
pushed a commit
to tdas/delta
that referenced
this pull request
May 31, 2023
- Updated README - Added LICENSE, CONTRIBUTING, NOTICE
andreaschat-db
added a commit
to andreaschat-db/delta
that referenced
this pull request
Apr 23, 2024
# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix
andreaschat-db
added a commit
to andreaschat-db/delta
that referenced
this pull request
Apr 23, 2024
# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix # This is the commit message delta-io#7: Avoid non-deterministic UDF to filter deleted rows # This is the commit message delta-io#8: metadata with Expression ID # This is the commit message delta-io#9: Fix complex views issue # This is the commit message delta-io#10: Tests # This is the commit message delta-io#11: cleaning # This is the commit message delta-io#12: More tests and fixes
andreaschat-db
added a commit
to andreaschat-db/delta
that referenced
this pull request
Apr 23, 2024
# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix # This is the commit message delta-io#7: Avoid non-deterministic UDF to filter deleted rows # This is the commit message delta-io#8: metadata with Expression ID # This is the commit message delta-io#9: Fix complex views issue # This is the commit message delta-io#10: Tests # This is the commit message delta-io#11: cleaning # This is the commit message delta-io#12: More tests and fixes # This is the commit message delta-io#13: Partial cleaning # This is the commit message delta-io#14: cleaning and improvements # This is the commit message delta-io#15: cleaning and improvements # This is the commit message delta-io#16: Clean RowIndexFilter
andreaschat-db
added a commit
to andreaschat-db/delta
that referenced
this pull request
Apr 26, 2024
# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix
andreaschat-db
added a commit
to andreaschat-db/delta
that referenced
this pull request
Apr 26, 2024
# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix # This is the commit message delta-io#7: Avoid non-deterministic UDF to filter deleted rows # This is the commit message delta-io#8: metadata with Expression ID # This is the commit message delta-io#9: Fix complex views issue # This is the commit message delta-io#10: Tests # This is the commit message delta-io#11: cleaning # This is the commit message delta-io#12: More tests and fixes
andreaschat-db
added a commit
to andreaschat-db/delta
that referenced
this pull request
Apr 26, 2024
# This is the 1st commit message: flush # This is the commit message delta-io#2: flush # This is the commit message delta-io#3: First sane version without isRowDeleted # This is the commit message delta-io#4: Hack RowIndexMarkingFilters # This is the commit message delta-io#5: Add support for non-vectorized readers # This is the commit message delta-io#6: Metadata column fix # This is the commit message delta-io#7: Avoid non-deterministic UDF to filter deleted rows # This is the commit message delta-io#8: metadata with Expression ID # This is the commit message delta-io#9: Fix complex views issue # This is the commit message delta-io#10: Tests # This is the commit message delta-io#11: cleaning # This is the commit message delta-io#12: More tests and fixes # This is the commit message delta-io#13: Partial cleaning # This is the commit message delta-io#14: cleaning and improvements # This is the commit message delta-io#15: cleaning and improvements # This is the commit message delta-io#16: Clean RowIndexFilter
zhipengmao-db
pushed a commit
to zhipengmao-db/delta
that referenced
this pull request
Jul 23, 2024
Update test_deltatable.py Update test_deltatable.py delta-io#4: Update
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Remove trailing spaces