-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-22357][CORE] SparkContext.binaryFiles ignore minPartitions parameter #21638
Conversation
Test build #92309 has finished for PR 21638 at commit
|
@@ -45,7 +45,8 @@ private[spark] abstract class StreamFileInputFormat[T] | |||
* which is set through setMaxSplitSize | |||
*/ | |||
def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) { | |||
val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES) | |||
val defaultMaxSplitBytes = Math.max( | |||
sc.getConf.get(config.FILES_MAX_PARTITION_BYTES), minPartitions) | |||
val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) | |||
val defaultParallelism = sc.defaultParallelism |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, shouldn't minPartitions
be used like this?
val defaultParallelism = Math.max(sc.defaultParallelism, if (minPartitions == 0) 1 else minPartitions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you describe the use case when you need to take into account minPartitions
. By default, FILES_MAX_PARTITION_BYTES
is 128MB. Let's say it is even set to 1000, and minPartitions
equals to 10 000. What is the reason to set the max size of splits in bytes to the min number of partition. Why should bigger number of partitions require bigger split size? Could you add more details to the PR description, please.
It seems there is similar code there: spark/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala Lines 424 to 433 in e76b012
|
Not sure yet but let's leave that out of this PR. |
val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) | ||
val defaultParallelism = sc.defaultParallelism | ||
val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now it makes much more sense.
Test build #92350 has finished for PR 21638 at commit
|
@HyukjinKwon please review. thanks. |
retest this please |
Test build #93067 has finished for PR 21638 at commit
|
@@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T] | |||
def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) { | |||
val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES) | |||
val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) | |||
val defaultParallelism = sc.defaultParallelism | |||
val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If sc.defaultParallelism
< 2, and minParititions
is not set in BinaryFileRDD
, then previously defaultParallelism
shall be the same as sc.defaultParallelism
, and after the change it will be 2
. Have you already consider this case and feel it's right behavior change to make?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to pass in the minPartitions to use this method, what do you mean minParititions is not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I metioned BinaryFileRDD
not this method, you can check the code to see how it handles the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BinaryFileRDD will set minPartitions, which will either be defaultMinPartitions, or the values you can set via binaryFiles(path, minPartitions) method. Eventually, this minPartitions value will be passed to setMinPartitions() method.
Because this method is internal to Spark, why not just take out the parameter? Yes it's superfluous now, but it's been this way for a while, and seems perhaps better to avoid a behavior change. In fact you can pull a |
Either way works for me, but I think since this is not a private method, so people may use it in their own approach. The minimal change will be the best. |
Except for |
Yea, it's internal to Spark. Might be good to keep it but that concern should be secondary IMHO. |
Test build #4290 has finished for PR 21638 at commit
|
Test build #95295 has finished for PR 21638 at commit
|
Merged to master |
@@ -47,7 +47,7 @@ private[spark] abstract class StreamFileInputFormat[T] | |||
def setMinPartitions(sc: SparkContext, context: JobContext, minPartitions: Int) { | |||
val defaultMaxSplitBytes = sc.getConf.get(config.FILES_MAX_PARTITION_BYTES) | |||
val openCostInBytes = sc.getConf.get(config.FILES_OPEN_COST_IN_BYTES) | |||
val defaultParallelism = sc.defaultParallelism | |||
val defaultParallelism = Math.max(sc.defaultParallelism, minPartitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have a test case; otherwise, we could hit the same issue again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, it is easy to add such a test case. We can even test the behaviors of the boundary cases. cc @srowen @HyukjinKwon @MaxGekk @jiangxb1987
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's hard to test, technically, because setMinPartitions
is only a hint. In the case of binaryFiles
we know it will put a hard limit on the number of partitions, but it isn't true of other implementations. We can still make a simple test for all of these, it just may be asserting behavior that could change in the future in Hadoop, though I strongly doubt it would.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it is hard to test. I appreciate If anyone can give me some hints of how to do these (how to verify and where to put my test cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind following up with a test that just asserts that asking for, say, 20 partitions results in 20 partitions? This is technically too specific as a test, but is probably fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the codes, you can see the calculation is just the intermediate result and this method won't return any value. Checking the split size does not make sense for this test case because it depends on multiple variables and this is just one of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sc = new SparkContext(new SparkConf().setAppName("test").setMaster("local")
.set(config.FILES_OPEN_COST_IN_BYTES.key, "0")
.set("spark.default.parallelism", "1"))
println(sc.binaryFiles(dirpath1, minPartitions = 50).getNumPartitions)
println(sc.binaryFiles(dirpath1, minPartitions = 1).getNumPartitions)
It is not hard to verify whether the parameter minPartitions
takes an effect. Currently, the description of this parameter is not clear. We need to document it clear which factors impact the actual number of partitions; otherwise, users will not understand how to use it.
@bomeng Could you submit a follow-up PR to add a test case? |
Yea, let's add a regression test. |
Here is the test code, not sure it is right or not ---
|
Ideally the last test should have 50 partitions? is it because we really need the test data to be at least 50 bytes? ideally a multiple of 50, I guess. |
…itions parameter ## What changes were proposed in this pull request? This adds a test following #21638 ## How was this patch tested? Existing tests and new test. Closes #22356 from srowen/SPARK-22357.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> (cherry picked from commit 4e3365b) Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
Fix the issue that minPartitions was not used in the method. This is a simple fix and I am not trying to make it complicated. The purpose is to still allow user to control the defaultParallelism through the value of minPartitions, while also via sc.defaultParallelism parameters.
How was this patch tested?
I have not provided the additional test since the fix is very straightforward.