Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-5068] [SQL] Fix bug query data when path doesn't exist for HiveContext #4356

Closed
wants to merge 6 commits into from

Conversation

chenghao-intel
Copy link
Contributor

This is a follow up for #3907 & #3891 .

Hive actually support the not existed path(either table or partition path) by yielding an empty row, but Spark SQL will throws exception.

Ideally, we need to check the path existence during the partition processing, however, the InputFormat always computes the file splits before that, hence exception will raised if the specified path doesn't exists.

This PR backs to the solution of #3891, and check the partition/table paths existence in spark plan generation. And of course we can move that logic into HadoopRDD if it support the non exist path in the future.

@jeanlyn, @marmbrus, @srowen

@SparkQA
Copy link

SparkQA commented Feb 4, 2015

Test build #26729 has started for PR 4356 at commit 1f033cd.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 4, 2015

Test build #26729 has finished for PR 4356 at commit 1f033cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26729/
Test PASSed.

filteredFiles.mkString(",")
case None => path.toString
private def applyFilterIfNeeded(path: Path, filterOpt: Option[PathFilter]): Option[String] = {
if (fs.exists(path)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd better get fs from the path,because in the hadoop namenode federation we may get some problems like Wrong FS exception if we use the FileSystem.get(sc.hiveconf) to get fs.

@SparkQA
Copy link

SparkQA commented Feb 4, 2015

Test build #26747 has started for PR 4356 at commit d3a4d3c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 4, 2015

Test build #26747 has finished for PR 4356 at commit d3a4d3c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26747/
Test PASSed.

case None => path.toString
private def applyFilterIfNeeded(path: Path, filterOpt: Option[PathFilter]): Option[String] = {
val fs = path.getFileSystem(sc.hiveconf)
if (fs.exists(path)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is similar to what @marmbrus mentioned in #3981. It's pretty expensive to check each path in serial for tables with lots of partitions. Especially when the data reside on S3. Can we use listStatus or globStatus to retrieve all FileStatus objects under some path(s), and then do the filtering locally?

@marmbrus
Copy link
Contributor

Can you reconcile this with #5059 and if that looks good close this issue? If we decide to go with the other one it would be good to include your test cases if you think they are valuable.

@chenghao-intel
Copy link
Contributor Author

Sorry for the delay, I am closing it.

asfgit pushed a commit that referenced this pull request Apr 12, 2015
…ontext

This PR follow up PR #3907 & #3891 & #4356.
According to  marmbrus  liancheng 's comments, I try to use fs.globStatus to retrieve all FileStatus objects under path(s), and then do the filtering locally.

[1]. get pathPattern by path, and put it into pathPatternSet. (hdfs://cluster/user/demo/2016/08/12 -> hdfs://cluster/user/demo/*/*/*)
[2]. retrieve all FileStatus objects ,and cache them by undating existPathSet.
[3]. do the filtering locally
[4]. if we have new pathPattern,do 1,2 step again. (external table maybe have more than one partition pathPattern)

chenghao-intel jeanlyn

Author: lazymam500 <lazyman500@gmail.com>
Author: lazyman <lazyman500@gmail.com>

Closes #5059 from lazyman500/SPARK-5068 and squashes the following commits:

5bfcbfd [lazyman] move spark.sql.hive.verifyPartitionPath to SQLConf,fix scala style
e1d6386 [lazymam500] fix scala style
f23133f [lazymam500] bug fix
47e0023 [lazymam500] fix scala style,add config flag,break the chaining
04c443c [lazyman] SPARK-5068: fix bug when partition path doesn't exists #2
41f60ce [lazymam500] Merge pull request #1 from apache/master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants