-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing #20087
Closed
Closed
Changes from all commits
Commits
Show all changes
59 commits
Select commit
Hold shift + click to select a range
9bbfe6e
[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'Par…
fjh100456 48cf108
[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'Par…
fjh100456 5dbd3ed
spark.sql.parquet.compression.codec[SPARK-21786][SQL] When acquiring …
fjh100456 5124f1b
spark.sql.parquet.compression.codec[SPARK-21786][SQL] When acquiring …
fjh100456 6907a3e
Make comression codec take effect in hive table writing.
fjh100456 67e40d4
Modify test
fjh100456 e2526ca
Separate the pr
fjh100456 8ae86ee
Add test case with the table containing mixed compression codec
fjh100456 94ac716
Revert back
fjh100456 43e041f
Revert back
fjh100456 ee0c558
Add a new line at the of file
fjh100456 e9f705d
Fix scala style
fjh100456 d3aa7a0
Fix scala style
fjh100456 5244aaf
[SPARK-22897][CORE] Expose stageAttemptId in TaskContext
advancedxy b96a213
[SPARK-22938] Assert that SQLConf.get is accessed only on the driver.
juliuszsompolski a05e85e
[SPARK-22934][SQL] Make optional clauses order insensitive for CREATE…
gatorsmile b962488
[SPARK-20236][SQL] dynamic partition overwrite
cloud-fan 27c949d
[SPARK-22932][SQL] Refactor AnalysisContext
gatorsmile 79f7263
[SPARK-22896] Improvement in String interpolation
chetkhatri a51212b
[SPARK-20960][SQL] make ColumnVector public
cloud-fan f51c8fd
[SPARK-22944][SQL] improve FoldablePropagation
cloud-fan 1860a43
[SPARK-22933][SPARKR] R Structured Streaming API for withWatermark, t…
felixcheung a7cfd6b
[SPARK-22950][SQL] Handle ChildFirstURLClassLoader's parent
yaooqinn eb99b8a
[SPARK-22945][SQL] add java UDF APIs in the functions object
cloud-fan 1f5e354
[SPARK-22939][PYSPARK] Support Spark UDF in registerFunction
gatorsmile bcfeef5
[SPARK-22771][SQL] Add a missing return statement in Concat.checkInpu…
maropu cd92913
[SPARK-21475][CORE][2ND ATTEMPT] Change to use NIO's Files API for ex…
jerryshao bc4bef4
[SPARK-22850][CORE] Ensure queued events are delivered to all event q…
2ab4012
[SPARK-22948][K8S] Move SparkPodInitContainer to correct package.
84707f0
[SPARK-22953][K8S] Avoids adding duplicated secret volumes when init-…
liyinan926 ea9da61
[SPARK-22960][K8S] Make build-push-docker-images.sh more dev-friendly.
158f7e6
[SPARK-22957] ApproxQuantile breaks if the number of rows exceeds MaxInt
juliuszsompolski 145820b
[SPARK-22825][SQL] Fix incorrect results of Casting Array to String
maropu 5b524cc
[SPARK-22949][ML] Apply CrossValidator approach to Driver/Distributed…
MrBago f9dcdbc
[SPARK-22757][K8S] Enable spark.jars and spark.files in KUBERNETES mode
liyinan926 fd4e304
[SPARK-22961][REGRESSION] Constant columns should generate QueryPlanC…
adrian-ionescu 0a30e93
[SPARK-22940][SQL] HiveExternalCatalogVersionsSuite should succeed on…
bersprockets d1f422c
[SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator
jkbradley 55afac4
[SPARK-22914][DEPLOY] Register history.ui.port
gerashegalov bf85301
[SPARK-22937][SQL] SQL elt output binary for binary inputs
maropu 3e3e938
[SPARK-22960][K8S] Revert use of ARG base_image in images
liyinan926 7236914
[SPARK-22930][PYTHON][SQL] Improve the description of Vectorized UDFs…
icexelloss e6449e8
[SPARK-22793][SQL] Memory leak in Spark Thrift Server
0377755
[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'Par…
fjh100456 b66700a
[SPARK-22901][PYTHON][FOLLOWUP] Adds the doc for asNondeterministic f…
HyukjinKwon f9e7b0c
[HOTFIX] Fix style checking failure
gatorsmile 285d342
[SPARK-22973][SQL] Fix incorrect results of Casting Map to String
maropu bd1a80a
Merge remote-tracking branch 'upstream/branch-2.3'
fjh100456 584cdc2
Merge pull request #2 from apache/master
fjh100456 5b150bc
Fix test issue
fjh100456 2337edd
Merge pull request #1 from apache/master
fjh100456 43e7eb5
Merge branch 'master' of https://github.com/fjh100456/spark
fjh100456 4b89b44
consider the precedence of `hive.exec.compress.output`
fjh100456 6cf32e0
Resume to private and add public function
fjh100456 365c5bf
Resume to private and add public function
fjh100456 99271d6
Fix test issue
fjh100456 2b9dfbe
Fix test issue
fjh100456 5b5e1df
Fix style issue
fjh100456 118f788
Fix style issue
fjh100456 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this is the existing behavior, but could you investigate how Hive behaves when
Parquet.Compress
is set. https://issues.apache.org/jira/browse/HIVE-7858 Is it the same as ORC?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surely, I'll do it this days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For parquet, using a hive client,
parquet.compression
has a higher priority thanmapreduce.output.fileoutputformat.compress
. And table-level compression( set by tblproperties) has the highest priority.parquet.compression
set by cli also has a higher priority thanmapreduce.output.fileoutputformat.compress
.After this pr, the priority does not changed. If table-level compression was set, other compression would not take effect, even though
mapreduce.output....
were set, which is the same with hive. Butparquet.compression
set by spark cli does not take effect, unless sethive.exec.compress.output
to true. This may because we do not getparquet.compression
from the session, and I wonder if it's necessary because we havespark.sql.parquet.comression.codec
instead.For orc,
hive.exec.compress.output
andmapreduce.output....
have no impact really, but table-leval compression (set by tblproperties) always take effect.orc.compression
set by spark cli does not take effect too, even though sethive.exec.compress.output
to true, which is differet with parquet.Another question, the comment say
it uses table properties to store compression information
, actully, by manul test, I found orc-tables also can have mixed compressions, and the data can be read together correctly, maybe I'm not very clear with what the comment mean.My Hive version for this test is 1.1.0. Actully it's a little difficut for me to get a higher version runable Hive client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment might not be correct now. We need to follow what the latest Hive works, if possible. The best way to try Hive (and the other RDBMS) is using docker. Maybe you can try the docker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll try it.