-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PYSPARK] Fix a typo in "fold" function in rdd.py #5587
Conversation
Can one of the admins verify this patch? |
I think that's right that it would be more consistent, in that this is how the Scala function that's ultimately called for
Elements are passed as the second arguments. The operation is commutative but it matters here since the operation is only supposed to modify the first arg. Why does this affect your ability to make a differently typed result? You need CC @JoshRosen as initial author |
I just follow the example of NAStatCounter in the book of "Advanced Analysis with Spark". NAStatCounter is supposed to get stats of a dataset where some missing values exist. Instead of using scala the same as the book, I use python to reimplement that. My code follows. class NAStatCounter(object):
def __init__(self):
self.stats = StatCounter()
self.missing = 0
def add(self, x):
if not x:
self.missing += 1
elif isinstance(x, NAStatCounter):
self.stats.mergeStats(x.stats)
self.missing += x.missing
else:
self.stats.merge(float(str(x)))
return self
def __str__(self):
return 'stats: ' + str(self.stats) + ' missing: ' + str(self.missing)
def __repr__(self):
return self.__str__() Here I make up a dummy dataset and the do the stats. rdd = sc.parallelize(['1,,2', ',3,1', '1,2,'])
na_stat = rdd.map(lambda x: x.split(','))
z = [NAStatCounter for i in xrange(3)]
op = lambda x, y: map(lambda a: a[0].add(a[1]), zip(x, y))
result = na_stat.fold(z, op) Then I get the error like "'str' object has no attribute 'add'" because it has op('1', NAStatCounter()) in the "fold" implementation. In the specified lambda function, it becomes '1'.add(NAStatCounter()). However, it's expected to be NAStatCounter().add('1'). As you mentioned, only the first argument can be modified and I guess it should be the provided "zeroValue" and the element are the second argument which is not allowed to be changed. Intuitively, users specify the "zeroValue" as "x" and elements as "y" in the lambda function. To correct that, I have to change op to op = lambda x, y: map(lambda a: a[0].add(a[1]), zip(y, x)) That's kind of anti-intuitive and inconsistant with its counterpart in scala just as the example in "Advanced Analysis with Spark". |
I agree that swapping the order of the function arguments makes more sense, but at this point I wonder whether it's safe to change this. The old code has been around since Spark 0.7 or something and it's possible that someone could be relying on the existing confusing behavior. Maybe we should just point out this inconsistency in the docs. |
This feels like it should be tracked as a JIRA, even if one targeted at version 2+ only to actually change. |
Thanks for replying. I guess it might be better to remain it unchanged and point it out at docs. I will be trying to do the modification at the doc. |
According to the discussion in apache#5587, it’s necessary to point out the lambda function in “fold” needs to take the opposite order.
ae08714
to
555731d
Compare
What changes were proposed in this pull request? This PR aims to upgrade okio from 1.15.0 to 1.17.6. Why are the changes needed? Okio 1.15.0 is vulnerable due to CVE-2023-3635, details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635 Previous attempts to fix this security issue: Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587 Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935 Unfortunately it is still using 1.15.0: https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210 Does this PR introduce any user-facing change? No. How was this patch tested? Pass the CIs. Was this patch authored or co-authored using generative AI tooling? No.
What changes were proposed in this pull request? This PR aims to upgrade okio from 1.15.0 to 1.17.6. Why are the changes needed? Okio 1.15.0 is vulnerable due to CVE-2023-3635, details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635 Previous attempts to fix this security issue: Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587 Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935 Unfortunately it is still using 1.15.0: https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210 Does this PR introduce any user-facing change? No. How was this patch tested? Pass the CIs. Was this patch authored or co-authored using generative AI tooling? No.
### What changes were proposed in this pull request? This PR aims to upgrade `okio` from 1.15.0 to 1.17.6. ### Why are the changes needed? Okio 1.15.0 is vulnerable due to CVE-2023-3635, details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635 Previous attempts to fix this security issue: Update okio to version 1.17.6 #5587: fabric8io/kubernetes-client#5587 Followup to Update okio to version 1.17.6 #5935: fabric8io/kubernetes-client#5935 Unfortunately it is still using 1.15.0: https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47758 from roczei/SPARK-45590. Authored-by: Gabor Roczei <roczei@cloudera.com> Signed-off-by: Kent Yao <yao@apache.org>
Backport apache#47758 to 3.5 This PR aims to upgrade `okio` from 1.15.0 to 1.17.6. Okio 1.15.0 is vulnerable due to CVE-2023-3635, details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635 Previous attempts to fix this security issue: Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587 Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935 Unfortunately it is still using 1.15.0: https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210 No. Pass the CIs. No. Closes apache#47758 from roczei/SPARK-45590. Authored-by: Gabor Roczei <roczei@cloudera.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit c8cf394)
Backport apache#47758 to 3.4 This PR aims to upgrade `okio` from 1.15.0 to 1.17.6. Okio 1.15.0 is vulnerable due to CVE-2023-3635, details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635 Previous attempts to fix this security issue: Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587 Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935 Unfortunately it is still using 1.15.0: https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210 No. Pass the CIs. No. Closes apache#47758 from roczei/SPARK-45590. Authored-by: Gabor Roczei <roczei@cloudera.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit c8cf394)
Backport #47758 to 3.5 ### What changes were proposed in this pull request? This PR aims to upgrade `okio` from 1.15.0 to 1.17.6. ### Why are the changes needed? Okio 1.15.0 is vulnerable due to CVE-2023-3635, details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635 Previous attempts to fix this security issue: Update okio to version 1.17.6 #5587: fabric8io/kubernetes-client#5587 Followup to Update okio to version 1.17.6 #5935: fabric8io/kubernetes-client#5935 Unfortunately it is still using 1.15.0: https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47769 from roczei/roczei/SPARK-45590-branch-3.5. Authored-by: Gabor Roczei <roczei@cloudera.com> Signed-off-by: Kent Yao <yao@apache.org>
Backport #47758 to 3.4 ### What changes were proposed in this pull request? This PR aims to upgrade `okio` from 1.15.0 to 1.17.6. ### Why are the changes needed? Okio 1.15.0 is vulnerable due to CVE-2023-3635, details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635 Previous attempts to fix this security issue: Update okio to version 1.17.6 #5587: fabric8io/kubernetes-client#5587 Followup to Update okio to version 1.17.6 #5935: fabric8io/kubernetes-client#5935 Unfortunately it is still using 1.15.0: https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47770 from roczei/SPARK-45590-branch-3.4. Authored-by: Gabor Roczei <roczei@cloudera.com> Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request? This PR aims to upgrade `okio` from 1.15.0 to 1.17.6. ### Why are the changes needed? Okio 1.15.0 is vulnerable due to CVE-2023-3635, details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635 Previous attempts to fix this security issue: Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587 Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935 Unfortunately it is still using 1.15.0: https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47758 from roczei/SPARK-45590. Authored-by: Gabor Roczei <roczei@cloudera.com> Signed-off-by: Kent Yao <yao@apache.org>
Backport apache#47758 to 3.4 This PR aims to upgrade `okio` from 1.15.0 to 1.17.6. Okio 1.15.0 is vulnerable due to CVE-2023-3635, details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635 Previous attempts to fix this security issue: Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587 Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935 Unfortunately it is still using 1.15.0: https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210 No. Pass the CIs. No. Closes apache#47770 from roczei/SPARK-45590-branch-3.4. Authored-by: Gabor Roczei <roczei@cloudera.com> Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request? This PR aims to upgrade `okio` from 1.15.0 to 1.17.6. ### Why are the changes needed? Okio 1.15.0 is vulnerable due to CVE-2023-3635, details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635 Previous attempts to fix this security issue: Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587 Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935 Unfortunately it is still using 1.15.0: https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47758 from roczei/SPARK-45590. Authored-by: Gabor Roczei <roczei@cloudera.com> Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request? This PR aims to upgrade `okio` from 1.15.0 to 1.17.6. ### Why are the changes needed? Okio 1.15.0 is vulnerable due to CVE-2023-3635, details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635 Previous attempts to fix this security issue: Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587 Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935 Unfortunately it is still using 1.15.0: https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47758 from roczei/SPARK-45590. Authored-by: Gabor Roczei <roczei@cloudera.com> Signed-off-by: Kent Yao <yao@apache.org>
This will make the “fold” function consistent with the "fold" in rdd.scala and other "aggregate" functions where “acc” goes first. Otherwise, users have to write a lambda function like “lambda x, y: op(y, x)” if they want to use “zeroValue” to change the result type.