Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4152] [SQL] Avoid data change in CTAS while table already existed #3013

Closed
wants to merge 1 commit into from

Conversation

chenghao-intel
Copy link
Contributor

CREATE TABLE t1 (a String);
CREATE TABLE t1 AS SELECT key FROM src; – throw exception
CREATE TABLE if not exists t1 AS SELECT key FROM src; – expect do nothing, currently it will overwrite the t1, which is incorrect.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22526 has started for PR 3013 at commit 6d18aa7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22526 has finished for PR 3013 at commit 6d18aa7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22526/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22538 has started for PR 3013 at commit c1ea850.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22538 has finished for PR 3013 at commit c1ea850.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22538/
Test FAILed.

@marmbrus
Copy link
Contributor

I'm confused by the failure. Do you possibly have HIVE_DEV_HOME set to Hive 12 or something such that its generating the wrong golden files?

@scwf
Copy link
Contributor

scwf commented Oct 30, 2014

I think @chenghao-intel used hive 0.12 to generate the golden files, while Jenkins test with hive 0.13

@marmbrus
Copy link
Contributor

Yeah, sorry. We have switched everything to Hive 13 (though we should still pass the tests when running in Hive 12 mode, otherwise they should be added to the HiveShim blacklist).

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22619 has started for PR 3013 at commit 1acc914.

  • This patch does not merge cleanly.

@chenghao-intel
Copy link
Contributor Author

Thank you @scwf @marmbrus , I've updated the code for unit testing based on Hive 0.13, and it passed the test locally.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22622 has started for PR 3013 at commit ec72555.

  • This patch merges cleanly.

@chenghao-intel
Copy link
Contributor Author

I also noticed that the golden files changed when switching from Hive 0.12 to 0.13, probably due to the decimal type incompatible. see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-Decimals

@marmbrus
Copy link
Contributor

Yes, blacklist any tests the rely on fixed decimals. This will be fixed by #2983.

@chenghao-intel
Copy link
Contributor Author

I don't think we need to blacklist the tests here. I've added TODO in the code. it's about the string output format for decimal.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22622 has finished for PR 3013 at commit ec72555.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22622/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 31, 2014

Test build #22619 has finished for PR 3013 at commit 1acc914.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22619/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22678 has started for PR 3013 at commit 4085c67.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 1, 2014

Test build #22678 has finished for PR 3013 at commit 4085c67.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22678/
Test PASSed.

"boolean" ^^^ BooleanType |
HiveShim.metastoreDecimal ^^^ DecimalType |
"boolean" ^^^ BooleanType | // TODO decimal Hive 0.12.0
"decimal\\((\\d+),(\\d+)\\)".r ^^^ DecimalType | // TODO decimal Hive 0.13.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need these todos here? this is both ok for hive 12 and 13, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ".q" files contains the create table like "CREATE TABLE DECIMAL_4_1(key decimal(35,25), value int)", however, the Jenkins seems compile spark sql with Hive 0.12, hence I have to put the both patterns here.

@chenghao-intel
Copy link
Contributor Author

Thank you @scwf @marmbrus for reviewing the code. Actually I am a little confused with the hive versions in Spark SQL. Seems the golden files are based on Hive 0.13, but Jenkins compiles the code with Hive 0.12. This why I have to put some of the work around code with TODOs.

@marmbrus
Copy link
Contributor

marmbrus commented Nov 2, 2014

Jenkins compiles with both versions just to make sure that we aren't breaking backwards compatibility (Hive 12 first). Ideally, we'll set up another job to run the test for Hive 12 in parallel or at least periodically, but for now running both would take too much time.

In terms of semantics I think it is too much overhead to try to faithfully mimic both versions since the primary goal here is metastore compatibility. Thus, the query tests are based on Hive13 and the golden answers are too. It is possible to run nearly of the tests with the Hive12 library too, though in places we act like 13 even though we are compiling with the 12 library. In the few cases where we can't run a given test with both versions of the library there is a special blacklist in the shim.

Full Hive 13 decimal support is now merged, so hopefully we can remove all the special cases from this PR.

@chenghao-intel
Copy link
Contributor Author

Thank you @marmbrus . I've updated the code just for the bug fixing. And will create another PRs for the Hive compatibility testing.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22785 has started for PR 3013 at commit 194113e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 3, 2014

Test build #22785 has finished for PR 3013 at commit 194113e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • abstract class GenericStrategy[PhysicalPlan <: TreeNode[PhysicalPlan]] extends Logging
    • trait RunnableCommand extends logical.Command
    • case class ExecutedCommand(cmd: RunnableCommand) extends SparkPlan
    • protected case class Keyword(str: String)
    • sys.error(s"Failed to load class for data source: $provider")
    • case class EqualTo(attribute: String, value: Any) extends Filter
    • case class GreaterThan(attribute: String, value: Any) extends Filter
    • case class GreaterThanOrEqual(attribute: String, value: Any) extends Filter
    • case class LessThan(attribute: String, value: Any) extends Filter
    • case class LessThanOrEqual(attribute: String, value: Any) extends Filter
    • trait RelationProvider
    • abstract class BaseRelation
    • abstract class TableScan extends BaseRelation
    • abstract class PrunedScan extends BaseRelation
    • abstract class PrunedFilteredScan extends BaseRelation

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22785/
Test PASSed.

@marmbrus
Copy link
Contributor

marmbrus commented Nov 3, 2014

Thanks! Merged to master.

@asfgit asfgit closed this in e83f13e Nov 3, 2014
asfgit pushed a commit that referenced this pull request Nov 3, 2014
CREATE TABLE t1 (a String);
CREATE TABLE t1 AS SELECT key FROM src; – throw exception
CREATE TABLE if not exists t1 AS SELECT key FROM src; – expect do nothing, currently it will overwrite the t1, which is incorrect.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #3013 from chenghao-intel/ctas_unittest and squashes the following commits:

194113e [Cheng Hao] fix bug in CTAS when table already existed

(cherry picked from commit e83f13e)
Signed-off-by: Michael Armbrust <michael@databricks.com>
@chenghao-intel chenghao-intel deleted the ctas_unittest branch June 9, 2015 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants