[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing #19218

fjh100456 · 2017-09-13T09:30:53Z

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing

What changes were proposed in this pull request?

Pass ‘spark.sql.parquet.compression.codec’ value to ‘parquet.compression’.
Pass ‘spark.sql.orc.compression.codec’ value to ‘orc.compress’.

How was this patch tested?

Add test.

…ation doesn't take effect on tables with partition field(s)

maropu · 2017-09-13T14:31:13Z

Could you add tests? Probably, you could insert some data then check if the data compressed by listing up files in temp dir?

…ation doesn't take effect on tables with partition field(s) Add test.

fjh100456 · 2017-09-14T12:38:00Z

cc @maropu

I have added the test. However, all of my local use cases do not work properly, so I'm not sure if the new use case will pass, but I will always be concerned.

maropu · 2017-09-15T01:18:57Z

@gatorsmile Is it worth fixing this? If so, could you trigger tests?

gatorsmile · 2017-09-15T07:50:15Z

ok to test

SparkQA · 2017-09-15T07:54:05Z

Test build #81813 has finished for PR 19218 at commit 4e70fff.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-15T07:59:53Z

@maropu @fjh100456 If the issue is true, we should fix it for sure. However, the PR description must be wrong. If this issue exists, it should be applicable to both partitioned and non-partitioned ORC/Parquet tables. cc @dongjoon-hyun I think you might be interested in ORC side.

fjh100456 · 2017-09-15T08:17:25Z

@gatorsmile

Non-partitioned tables do not have this problem, 'spark.sql.parquet.compression.codec' can take effect normally, because the process of writing data differs from that of a partitioned table.
ORC does not have a configuration similar to 'spark. sql.* ', but can only use ' orc. compress ' which may not be a spark configuration.

…ation doesn't take effect on tables with partition field(s) Fix scala style.

SparkQA · 2017-09-15T08:39:03Z

Test build #81815 has finished for PR 19218 at commit 3f022f9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

…ation doesn't take effect on tables with partition field(s) Fix scala style.

SparkQA · 2017-09-15T10:00:23Z

Test build #81816 has finished for PR 19218 at commit 6d77bf9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…ation doesn't take effect on tables with partition field(s) Fix test problem

SparkQA · 2017-09-15T12:33:36Z

Test build #81820 has finished for PR 19218 at commit 42aca3d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2017-09-15T14:21:26Z

Thank you for pinging me, @gatorsmile.

dongjoon-hyun · 2017-09-15T15:55:36Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

@@ -101,6 +101,13 @@ case class InsertIntoHiveTable(
    val tmpLocation = getExternalTmpPath(sparkSession, hadoopConf, tableLocation)
    val fileSinkConf = new FileSinkDesc(tmpLocation.toString, tableDesc, false)

+    tableDesc.getOutputFileFormatClassName match {
+      case "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat" =>


Parquet: It seems that you need to consider another output format, parquet.hive.DeprecatedParquetOutputFormat, too.

ORC: We have spark.sql.orc.compression.codec by SPARK-21839.

fjh100456 · 2017-09-16T05:47:30Z

@dongjoon-hyun Thank you very much, I'll fix it now.

…ation doesn't take effect on tables with partition field(s) Fix the similar issue of orc compression

…ation doesn't take effect on tables with partition field(s) Fix test problem

SparkQA · 2017-09-16T09:24:05Z

Test build #81839 has finished for PR 19218 at commit 5cbe999.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-16T10:45:45Z

Test build #81840 has finished for PR 19218 at commit 732266c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…ation doesn't take effect on tables with partition field(s) Fix test problem

fjh100456 · 2017-09-16T11:33:41Z

@dongjoon-hyun

A problem has been encountered, There are two ways to specify the compression format:

CREATE TABLE Test(id int) STORED AS ORC TBLPROPERTIES ('orc.compress'='SNAPPY');
set orc.compress=ZLIB;

If the table already has been specified a compression format when it was created, and then specified another compression format by setting 'orc.compress', the latter will take effect.

So whether the spark side should not have the default value, we can distinguish by 'undefined'; or discard this change, and explain in the document that 'spark.sql.parquet.compression.codec' for partitioned tables does not take effect, and 'spark.sql.orc.compression.codec ' is not valid for hive tables. Or your other better solution.

SparkQA · 2017-09-16T12:56:43Z

Test build #81841 has finished for PR 19218 at commit c7ff62c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-09-17T06:10:05Z

I see. If you set spark.sql.hive.convertMetastoreParquet to false, you will also hit the issue for non-partitioned table.

Please update your PR description and title.

gatorsmile · 2017-09-17T06:11:55Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

@@ -101,6 +101,19 @@ case class InsertIntoHiveTable(
    val tmpLocation = getExternalTmpPath(sparkSession, hadoopConf, tableLocation)
    val fileSinkConf = new FileSinkDesc(tmpLocation.toString, tableDesc, false)

+    tableDesc.getOutputFileFormatClassName match {


Move the whole logics into saveAsHiveFile, which is being shared by InsertIntoHiveDirCommand and InsertIntoHiveTable. Both need these logics.

gatorsmile · 2017-09-17T06:21:04Z

@fjh100456 We have priority for three different inputs. Here, you just consider one of three. Please also add the extra checks. Hopefully, @dongjoon-hyun can help you answer your questions. He just finished the work in #19055

dongjoon-hyun · 2017-09-17T21:35:15Z

Sorry, guys. I've been away from keyboard since last Friday night. I'll be back on next Tuesday (PST).

fjh100456 · 2017-12-18T13:22:13Z

I will change the code with the suggestion of @gatorsmile ,it's a little busy this days.I will do it tomorrow.

…rk.sql.orc.compression.codec' configuration doesn't take effect on hive table writing Move the whole determination logics to HiveOptions

fjh100456 · 2017-12-19T09:30:48Z

@gatorsmile @maropu Does it look better now? About statistic issue, is there any suggestion?

@SparkQA Please start test, thanks.

gatorsmile · 2017-12-20T04:51:49Z

ok to test

SparkQA · 2017-12-20T04:54:25Z

Test build #85154 has finished for PR 19218 at commit d779ee6.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ParquetOptions(

…rk.sql.orc.compression.codec' configuration doesn't take effect on hive table writing Fix scala style

SparkQA · 2017-12-20T13:03:55Z

Test build #85173 has finished for PR 19218 at commit 0cb7b7a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

fjh100456 · 2017-12-21T15:25:26Z

@gatorsmile Could you help to review it? Thanks very much!

gatorsmile · 2017-12-22T07:54:20Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala

+        Option((ParquetOutputFormat.COMPRESSION, compressionCodec))
+      case formatName if formatName.endsWith("orcoutputformat") =>
+        val compressionCodec = new OrcOptions(tableInfo.getProperties.asScala.toMap,
+          sqlConf).compressionCodec


Also update OrcOptions's compressionCodec to compressionCodecClassName

The compressionCodec is used in several places, do you mean I should fix them all?

Yeah. Just to make it consistent

gatorsmile · 2017-12-22T07:56:28Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala

@@ -35,7 +39,7 @@ case class TestData(key: Int, value: String)
 case class ThreeCloumntable(key: Int, value: String, key1: String)

 class InsertSuite extends QueryTest with TestHiveSingleton with BeforeAndAfter
-    with SQLTestUtils {
+    with ParquetTest {


This is the insert suite. We are unable to do this.

Could you create a separate suite in the current package org.apache.spark.sql.hive? The suite name can be CompressionCodecSuite

Please also check whether the compression takes an effect? Compare the size whether is smaller than the original size without compressions?

Ok, I will do it.

Seems compressed table does not always be smaller than uncompressed tables.
SNAPPY Compression size may be bigger than non-compression size when the amount of data is not big. So I'd like to check the size not equal when compression are different.

Fine to me. Thanks!

gatorsmile · 2017-12-22T07:59:18Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala

+    tableInfo.getOutputFileFormatClassName.toLowerCase match {
+      case formatName if formatName.endsWith("parquetoutputformat") =>
+        val compressionCodec = new ParquetOptions(tableInfo.getProperties.asScala.toMap,
+          sqlConf).compressionCodecClassName


We normally do not split the code like this. We like the following way:

val tableProps = tableInfo.getProperties.asScala.toMap tableInfo.getOutputFileFormatClassName.toLowerCase match { case formatName if formatName.endsWith("parquetoutputformat") => val compressionCodec = new ParquetOptions(tableProps, sqlConf).compressionCodecClassName Option((ParquetOutputFormat.COMPRESSION, compressionCodec)) ...

Yes it looks better, I will change it.

gatorsmile · 2017-12-22T08:02:09Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveOptions.scala

@@ -19,7 +19,16 @@ package org.apache.spark.sql.hive.execution

 import java.util.Locale

+import scala.collection.JavaConverters._
+
+import org.apache.hadoop.hive.ql.plan.{FileSinkDesc, TableDesc}


FileSinkDesc is still needed?

I will remove it.

gatorsmile · 2017-12-22T08:04:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala

+    val parquetCompressionConf = parameters.get(ParquetOutputFormat.COMPRESSION)
+    val codecName = parameters
+      .get("compression")
+      .orElse(parquetCompressionConf)


Is this new? Do we support parquet.compression before this PR?

Yes it's new. I guess PartitionOptions did not used when writing hive table before, because it's invisible for hive. I changeed it to public.

Could we keep the old behavior? We could add it later? We do not want to mix multiple issues in the same PR?

If so, parquet's table-level compression may be overwrited in this PR, and it may not be what we want.
Shall I fix it first in another PR?

Yeah, we can submit a separate PR for that issue. The behavior change needs to be documented in SparkSQL doc.

gatorsmile · 2017-12-22T08:08:28Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/SaveAsHiveFile.scala

+    HiveOptions.getHiveWriteCompression(fileSinkConf.getTableInfo, sparkSession.sessionState.conf)
+      .foreach{ case (compression, codec) =>
+        hadoopConf.set(compression, codec)
+      }


.foreach { case (compression, codec) => hadoopConf.set(compression, codec) }

gatorsmile · 2017-12-22T08:14:54Z

Could you also add another test scenario? For the existing Hive tables (created by Hive), does our Spark respect it? Do we use the existing compression configuration?

fjh100456 · 2017-12-23T07:49:26Z

@gatorsmile
I'd test manully. When table-level compression not configured, it always take the session level compression, and ignore the existing file compression. Seems like a bug, however, table files with multiple compressions do not affect the reading and writing.
Is it ok to add a test to check reading and writing when there are multiple conpressions in the existing table files?

gatorsmile · 2017-12-23T07:52:25Z

What are multiple compressions?

fjh100456 · 2017-12-23T08:16:12Z

I mean sevral files with different compressioncodcs under the same table directory, like below:

It were generated by changing the configation before inserting, like the follow scala code:

  spark.sql("set spark.sql.parquet.compression.codec=uncompressed")
  spark.sql("insert into table temp_parquet_b select * from datasource_table")
  spark.sql("set spark.sql.parquet.compression.codec=snappy")
  spark.sql("insert into table temp_parquet_b select * from datasource_table")
  spark.sql("set spark.sql.parquet.compression.codec=gzip")
  spark.sql("insert into table temp_parquet_b select * from datasource_table")

SparkQA · 2017-12-23T12:14:33Z

Test build #85334 has finished for PR 19218 at commit 78e0403.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-23T12:16:09Z

Test build #85335 has finished for PR 19218 at commit 7804f60.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ParquetOptions(

SparkQA · 2017-12-23T13:39:24Z

Test build #85340 has finished for PR 19218 at commit 52cdd75.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-12-25T08:07:04Z

@fjh100456 Yeah. Please also add the test cases with the table containing mixed compression codec. Thanks!

I have some comments about your fix. See my commit:
d8fbdae

I will review your fix later

gatorsmile · 2017-12-25T08:07:52Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/CompressionCodecSuite.scala

+class CompressionCodecSuite extends TestHiveSingleton with ParquetTest {
+  import spark.implicits._
+
+  private val maxRecordNum = 100000


Could you reduce it to a smaller number? The test cases are very slow to run.

fjh100456 · 2017-12-26T12:38:58Z

I'd finished to write the test case with the table containing mixed compression codec. But maybe I'd made a mistake, the original branch was deleted mistakenly, I will closed this PR and create another PR. Sorry.

fjh100456 · 2017-12-27T03:33:42Z

Please go to #20087

…rk.sql.orc.compression.codec' configuration doesn't take effect on hive table writing [SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing What changes were proposed in this pull request? Pass ‘spark.sql.parquet.compression.codec’ value to ‘parquet.compression’. Pass ‘spark.sql.orc.compression.codec’ value to ‘orc.compress’. How was this patch tested? Add test. Note: This is the same issue mentioned in #19218 . That branch was deleted mistakenly, so make a new pr instead. gatorsmile maropu dongjoon-hyun discipleforteen Author: fjh100456 <fu.jinhua6@zte.com.cn> Author: Takeshi Yamamuro <yamamuro@apache.org> Author: Wenchen Fan <wenchen@databricks.com> Author: gatorsmile <gatorsmile@gmail.com> Author: Yinan Li <liyinan926@gmail.com> Author: Marcelo Vanzin <vanzin@cloudera.com> Author: Juliusz Sompolski <julek@databricks.com> Author: Felix Cheung <felixcheung_m@hotmail.com> Author: jerryshao <sshao@hortonworks.com> Author: Li Jin <ice.xelloss@gmail.com> Author: Gera Shegalov <gera@apache.org> Author: chetkhatri <ckhatrimanjal@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Author: Bago Amirbekian <bago@databricks.com> Author: Xianjin YE <advancedxy@gmail.com> Author: Bruce Robbins <bersprockets@gmail.com> Author: zuotingbing <zuo.tingbing9@zte.com.cn> Author: Kent Yao <yaooqinn@hotmail.com> Author: hyukjinkwon <gurwls223@gmail.com> Author: Adrian Ionescu <adrian@databricks.com> Closes #20087 from fjh100456/HiveTableWriting. (cherry picked from commit 00d1691) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…rk.sql.orc.compression.codec' configuration doesn't take effect on hive table writing [SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing What changes were proposed in this pull request? Pass ‘spark.sql.parquet.compression.codec’ value to ‘parquet.compression’. Pass ‘spark.sql.orc.compression.codec’ value to ‘orc.compress’. How was this patch tested? Add test. Note: This is the same issue mentioned in apache#19218 . That branch was deleted mistakenly, so make a new pr instead. gatorsmile maropu dongjoon-hyun discipleforteen Author: fjh100456 <fu.jinhua6@zte.com.cn> Author: Takeshi Yamamuro <yamamuro@apache.org> Author: Wenchen Fan <wenchen@databricks.com> Author: gatorsmile <gatorsmile@gmail.com> Author: Yinan Li <liyinan926@gmail.com> Author: Marcelo Vanzin <vanzin@cloudera.com> Author: Juliusz Sompolski <julek@databricks.com> Author: Felix Cheung <felixcheung_m@hotmail.com> Author: jerryshao <sshao@hortonworks.com> Author: Li Jin <ice.xelloss@gmail.com> Author: Gera Shegalov <gera@apache.org> Author: chetkhatri <ckhatrimanjal@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Author: Bago Amirbekian <bago@databricks.com> Author: Xianjin YE <advancedxy@gmail.com> Author: Bruce Robbins <bersprockets@gmail.com> Author: zuotingbing <zuo.tingbing9@zte.com.cn> Author: Kent Yao <yaooqinn@hotmail.com> Author: hyukjinkwon <gurwls223@gmail.com> Author: Adrian Ionescu <adrian@databricks.com> Closes apache#20087 from fjh100456/HiveTableWriting.

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configur…

677541b

…ation doesn't take effect on tables with partition field(s)

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configur…

4e70fff

…ation doesn't take effect on tables with partition field(s) Add test.

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configur…

3f022f9

…ation doesn't take effect on tables with partition field(s) Fix scala style.

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configur…

6d77bf9

…ation doesn't take effect on tables with partition field(s) Fix scala style.

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configur…

42aca3d

…ation doesn't take effect on tables with partition field(s) Fix test problem

dongjoon-hyun reviewed Sep 15, 2017

View reviewed changes

fjh100456 added 2 commits September 16, 2017 16:04

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configur…

5cbe999

…ation doesn't take effect on tables with partition field(s) Fix the similar issue of orc compression

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configur…

732266c

…ation doesn't take effect on tables with partition field(s) Fix test problem

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' configur…

c7ff62c

…ation doesn't take effect on tables with partition field(s) Fix test problem

gatorsmile reviewed Sep 17, 2017

View reviewed changes

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spa…

d779ee6

…rk.sql.orc.compression.codec' configuration doesn't take effect on hive table writing Move the whole determination logics to HiveOptions

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spa…

0cb7b7a

…rk.sql.orc.compression.codec' configuration doesn't take effect on hive table writing Fix scala style

gatorsmile reviewed Dec 22, 2017

View reviewed changes

fjh100456 added 2 commits December 23, 2017 16:56

Resume the changing, and change it in another pr later.

78e0403

Change to public

7804f60

Fix the code with gatorsmile's suggestion.

52cdd75

gatorsmile reviewed Dec 25, 2017

View reviewed changes

fjh100456 mentioned this pull request Dec 26, 2017

[SPARK-21786][SQL] When acquiring 'compressionCodecClassName' in 'ParquetOptions', parquet.compression needs to be considered. #20076

Closed

fjh100456 mentioned this pull request Dec 27, 2017

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing #20087

Closed

fjh100456 closed this Dec 27, 2017

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing #19218

[SPARK-21786][SQL] The 'spark.sql.parquet.compression.codec' and 'spark.sql.orc.compression.codec' configuration doesn't take effect on hive table writing #19218

Conversation

fjh100456 commented Sep 13, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

maropu commented Sep 13, 2017

fjh100456 commented Sep 14, 2017

maropu commented Sep 15, 2017

gatorsmile commented Sep 15, 2017

SparkQA commented Sep 15, 2017

gatorsmile commented Sep 15, 2017

fjh100456 commented Sep 15, 2017

SparkQA commented Sep 15, 2017

SparkQA commented Sep 15, 2017

SparkQA commented Sep 15, 2017

dongjoon-hyun commented Sep 15, 2017

Choose a reason for hiding this comment

fjh100456 commented Sep 16, 2017

SparkQA commented Sep 16, 2017

SparkQA commented Sep 16, 2017

fjh100456 commented Sep 16, 2017 • edited Loading

SparkQA commented Sep 16, 2017

gatorsmile commented Sep 17, 2017 • edited Loading

Choose a reason for hiding this comment

gatorsmile commented Sep 17, 2017

dongjoon-hyun commented Sep 17, 2017

fjh100456 commented Dec 18, 2017

fjh100456 commented Dec 19, 2017

gatorsmile commented Dec 20, 2017

SparkQA commented Dec 20, 2017

SparkQA commented Dec 20, 2017

fjh100456 commented Dec 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjh100456 Dec 22, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Dec 22, 2017

fjh100456 commented Dec 23, 2017

gatorsmile commented Dec 23, 2017 • edited Loading

fjh100456 commented Dec 23, 2017 • edited Loading

SparkQA commented Dec 23, 2017

SparkQA commented Dec 23, 2017

SparkQA commented Dec 23, 2017

gatorsmile commented Dec 25, 2017

Choose a reason for hiding this comment

fjh100456 commented Dec 26, 2017

fjh100456 commented Dec 27, 2017

fjh100456 commented Sep 13, 2017 •

edited

Loading

fjh100456 commented Sep 16, 2017 •

edited

Loading

gatorsmile commented Sep 17, 2017 •

edited

Loading

fjh100456 Dec 22, 2017 •

edited

Loading

gatorsmile commented Dec 23, 2017 •

edited

Loading

fjh100456 commented Dec 23, 2017 •

edited

Loading