Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PYSPARK] Fix a typo in "fold" function in rdd.py #5587

Closed
wants to merge 0 commits into from

Conversation

aihex
Copy link

@aihex aihex commented Apr 20, 2015

This will make the “fold” function consistent with the "fold" in rdd.scala and other "aggregate" functions where “acc” goes first. Otherwise, users have to write a lambda function like “lambda x, y: op(y, x)” if they want to use “zeroValue” to change the result type.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@srowen
Copy link
Member

srowen commented Apr 20, 2015

I think that's right that it would be more consistent, in that this is how the Scala function that's ultimately called for RDD.fold looks:

  def foldLeft[B](z: B)(op: (B, A) => B): B = {
    var result = z
    this.seq foreach (x => result = op(result, x))
    result
  }

Elements are passed as the second arguments. The operation is commutative but it matters here since the operation is only supposed to modify the first arg.

Why does this affect your ability to make a differently typed result? You need aggregate for that anyway I thought.

CC @JoshRosen as initial author

@aihex
Copy link
Author

aihex commented Apr 20, 2015

@JoshRosen

I just follow the example of NAStatCounter in the book of "Advanced Analysis with Spark". NAStatCounter is supposed to get stats of a dataset where some missing values exist.

Instead of using scala the same as the book, I use python to reimplement that. My code follows.

class NAStatCounter(object):

    def __init__(self):
        self.stats = StatCounter()
        self.missing = 0

    def add(self, x):
        if not x:
            self.missing += 1
        elif isinstance(x, NAStatCounter):
            self.stats.mergeStats(x.stats)
            self.missing += x.missing
        else:
            self.stats.merge(float(str(x)))
        return self

    def __str__(self):
        return 'stats: ' + str(self.stats) + ' missing: ' + str(self.missing)

    def __repr__(self):
        return self.__str__()

Here I make up a dummy dataset and the do the stats.

rdd = sc.parallelize(['1,,2', ',3,1', '1,2,'])
na_stat = rdd.map(lambda x: x.split(','))
z = [NAStatCounter for i in xrange(3)]
op = lambda x, y: map(lambda a: a[0].add(a[1]), zip(x, y))
result = na_stat.fold(z, op)

Then I get the error like "'str' object has no attribute 'add'" because it has op('1', NAStatCounter()) in the "fold" implementation. In the specified lambda function, it becomes '1'.add(NAStatCounter()). However, it's expected to be NAStatCounter().add('1').

As you mentioned, only the first argument can be modified and I guess it should be the provided "zeroValue" and the element are the second argument which is not allowed to be changed.

Intuitively, users specify the "zeroValue" as "x" and elements as "y" in the lambda function. To correct that, I have to change op to

op = lambda x, y: map(lambda a: a[0].add(a[1]), zip(y, x))

That's kind of anti-intuitive and inconsistant with its counterpart in scala just as the example in "Advanced Analysis with Spark".

@JoshRosen
Copy link
Contributor

I agree that swapping the order of the function arguments makes more sense, but at this point I wonder whether it's safe to change this. The old code has been around since Spark 0.7 or something and it's possible that someone could be relying on the existing confusing behavior. Maybe we should just point out this inconsistency in the docs.

@srowen
Copy link
Member

srowen commented Apr 20, 2015

This feels like it should be tracked as a JIRA, even if one targeted at version 2+ only to actually change.

@aihex
Copy link
Author

aihex commented Apr 20, 2015

Thanks for replying. I guess it might be better to remain it unchanged and point it out at docs. I will be trying to do the modification at the doc.

aihex pushed a commit to aihex/spark that referenced this pull request Apr 20, 2015
According to the discussion in apache#5587, it’s necessary to point out the
lambda function in “fold” needs to take the opposite order.
@aihex aihex force-pushed the master branch 2 times, most recently from ae08714 to 555731d Compare April 21, 2015 20:58
@aihex aihex closed this Apr 22, 2015
roczei added a commit to roczei/spark that referenced this pull request Aug 14, 2024
What changes were proposed in this pull request?
This PR aims to upgrade okio from 1.15.0 to 1.17.6.

Why are the changes needed?
Okio 1.15.0 is vulnerable due to CVE-2023-3635, details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635

Previous attempts to fix this security issue:

Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587
Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935

Unfortunately it is still using 1.15.0:

https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227
https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210

Does this PR introduce any user-facing change?
No.

How was this patch tested?
Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?
No.
roczei added a commit to roczei/spark that referenced this pull request Aug 14, 2024
What changes were proposed in this pull request?

This PR aims to upgrade okio from 1.15.0 to 1.17.6.

Why are the changes needed?

Okio 1.15.0 is vulnerable due to CVE-2023-3635, details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635

Previous attempts to fix this security issue:

Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587
Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935

Unfortunately it is still using 1.15.0:

https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227
https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the CIs.

Was this patch authored or co-authored using generative AI tooling?

No.
yaooqinn pushed a commit that referenced this pull request Aug 15, 2024
### What changes were proposed in this pull request?

This PR aims to upgrade `okio` from 1.15.0 to 1.17.6.

### Why are the changes needed?

Okio 1.15.0 is vulnerable due to CVE-2023-3635,  details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635

Previous attempts to fix this security issue:

Update okio to version 1.17.6 #5587: fabric8io/kubernetes-client#5587
Followup to Update okio to version 1.17.6 #5935: fabric8io/kubernetes-client#5935

Unfortunately it is still using 1.15.0:

https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227
https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47758 from roczei/SPARK-45590.

Authored-by: Gabor Roczei <roczei@cloudera.com>
Signed-off-by: Kent Yao <yao@apache.org>
roczei added a commit to roczei/spark that referenced this pull request Aug 15, 2024
Backport apache#47758 to 3.5

This PR aims to upgrade `okio` from 1.15.0 to 1.17.6.

Okio 1.15.0 is vulnerable due to CVE-2023-3635,  details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635

Previous attempts to fix this security issue:

Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587
Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935

Unfortunately it is still using 1.15.0:

https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227
https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210

No.

Pass the CIs.

No.

Closes apache#47758 from roczei/SPARK-45590.

Authored-by: Gabor Roczei <roczei@cloudera.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit c8cf394)
roczei added a commit to roczei/spark that referenced this pull request Aug 15, 2024
Backport apache#47758 to 3.4

This PR aims to upgrade `okio` from 1.15.0 to 1.17.6.

Okio 1.15.0 is vulnerable due to CVE-2023-3635,  details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635

Previous attempts to fix this security issue:

Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587
Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935

Unfortunately it is still using 1.15.0:

https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227
https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210

No.

Pass the CIs.

No.

Closes apache#47758 from roczei/SPARK-45590.

Authored-by: Gabor Roczei <roczei@cloudera.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit c8cf394)
yaooqinn pushed a commit that referenced this pull request Aug 16, 2024
Backport #47758 to 3.5

### What changes were proposed in this pull request?

This PR aims to upgrade `okio` from 1.15.0 to 1.17.6.

### Why are the changes needed?

Okio 1.15.0 is vulnerable due to CVE-2023-3635,  details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635

Previous attempts to fix this security issue:

Update okio to version 1.17.6 #5587: fabric8io/kubernetes-client#5587
Followup to Update okio to version 1.17.6 #5935: fabric8io/kubernetes-client#5935

Unfortunately it is still using 1.15.0:

https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47769 from roczei/roczei/SPARK-45590-branch-3.5.

Authored-by: Gabor Roczei <roczei@cloudera.com>
Signed-off-by: Kent Yao <yao@apache.org>
yaooqinn pushed a commit that referenced this pull request Aug 16, 2024
Backport #47758 to 3.4

### What changes were proposed in this pull request?

This PR aims to upgrade `okio` from 1.15.0 to 1.17.6.

### Why are the changes needed?

Okio 1.15.0 is vulnerable due to CVE-2023-3635,  details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635

Previous attempts to fix this security issue:

Update okio to version 1.17.6 #5587: fabric8io/kubernetes-client#5587
Followup to Update okio to version 1.17.6 #5935: fabric8io/kubernetes-client#5935

Unfortunately it is still using 1.15.0:

https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #47770 from roczei/SPARK-45590-branch-3.4.

Authored-by: Gabor Roczei <roczei@cloudera.com>
Signed-off-by: Kent Yao <yao@apache.org>
IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
### What changes were proposed in this pull request?

This PR aims to upgrade `okio` from 1.15.0 to 1.17.6.

### Why are the changes needed?

Okio 1.15.0 is vulnerable due to CVE-2023-3635,  details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635

Previous attempts to fix this security issue:

Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587
Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935

Unfortunately it is still using 1.15.0:

https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227
https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47758 from roczei/SPARK-45590.

Authored-by: Gabor Roczei <roczei@cloudera.com>
Signed-off-by: Kent Yao <yao@apache.org>
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Sep 24, 2024
Backport apache#47758 to 3.4

This PR aims to upgrade `okio` from 1.15.0 to 1.17.6.

Okio 1.15.0 is vulnerable due to CVE-2023-3635,  details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635

Previous attempts to fix this security issue:

Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587
Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935

Unfortunately it is still using 1.15.0:

https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227 https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210

No.

Pass the CIs.

No.

Closes apache#47770 from roczei/SPARK-45590-branch-3.4.

Authored-by: Gabor Roczei <roczei@cloudera.com>
Signed-off-by: Kent Yao <yao@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
### What changes were proposed in this pull request?

This PR aims to upgrade `okio` from 1.15.0 to 1.17.6.

### Why are the changes needed?

Okio 1.15.0 is vulnerable due to CVE-2023-3635,  details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635

Previous attempts to fix this security issue:

Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587
Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935

Unfortunately it is still using 1.15.0:

https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227
https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47758 from roczei/SPARK-45590.

Authored-by: Gabor Roczei <roczei@cloudera.com>
Signed-off-by: Kent Yao <yao@apache.org>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
### What changes were proposed in this pull request?

This PR aims to upgrade `okio` from 1.15.0 to 1.17.6.

### Why are the changes needed?

Okio 1.15.0 is vulnerable due to CVE-2023-3635,  details: https://nvd.nist.gov/vuln/detail/CVE-2023-3635

Previous attempts to fix this security issue:

Update okio to version 1.17.6 apache#5587: fabric8io/kubernetes-client#5587
Followup to Update okio to version 1.17.6 apache#5935: fabric8io/kubernetes-client#5935

Unfortunately it is still using 1.15.0:

https://github.com/apache/spark/blob/v4.0.0-preview1/dev/deps/spark-deps-hadoop-3-hive-2.3#L227
https://github.com/apache/spark/blob/v3.5.2/dev/deps/spark-deps-hadoop-3-hive-2.3#L210

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47758 from roczei/SPARK-45590.

Authored-by: Gabor Roczei <roczei@cloudera.com>
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants