Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22032][PySpark] Speed up StructType conversion #19249

Closed
wants to merge 6 commits into from

Conversation

maver1ck
Copy link
Contributor

@maver1ck maver1ck commented Sep 15, 2017

What changes were proposed in this pull request?

StructType.fromInternal is calling f.fromInternal(v) for every field.
We can use precalculated information about type to limit the number of function calls. (its calculated once per StructType and used in per record calculations)

Benchmarks (Python profiler)

df = spark.range(10000000).selectExpr("id as id0", "id as id1", "id as id2", "id as id3", "id as id4", "id as id5", "id as id6", "id as id7", "id as id8", "id as id9", "struct(id) as s").cache()
df.count()
df.rdd.map(lambda x: x).count()

Before

310274584 function calls (300272456 primitive calls) in 1320.684 seconds

Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 10000000  253.417    0.000  486.991    0.000 types.py:619(<listcomp>)
 30000000  192.272    0.000 1009.986    0.000 types.py:612(fromInternal)
100000000  176.140    0.000  176.140    0.000 types.py:88(fromInternal)
 20000000  156.832    0.000  328.093    0.000 types.py:1471(_create_row)
    14000  107.206    0.008 1237.917    0.088 {built-in method loads}
 20000000   80.176    0.000 1090.162    0.000 types.py:1468(<lambda>)

After

210274584 function calls (200272456 primitive calls) in 1035.974 seconds

Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 30000000  215.845    0.000  698.748    0.000 types.py:612(fromInternal)
 20000000  165.042    0.000  351.572    0.000 types.py:1471(_create_row)
    14000  116.834    0.008  946.791    0.068 {built-in method loads}
 20000000   87.326    0.000  786.073    0.000 types.py:1468(<lambda>)
 20000000   85.477    0.000  134.607    0.000 types.py:1519(__new__)
 10000000   65.777    0.000  126.712    0.000 types.py:619(<listcomp>)

Main difference is types.py:619() and types.py:88(fromInternal) (which is removed in After)
The number of function calls is 100 million less. And performance is 20% better.

Benchmark (worst case scenario.)

Test

df = spark.range(1000000).selectExpr("current_timestamp as id0", "current_timestamp as id1", "current_timestamp as id2", "current_timestamp as id3", "current_timestamp as id4", "current_timestamp as id5", "current_timestamp as id6", "current_timestamp as id7", "current_timestamp as id8", "current_timestamp as id9").cache()
df.count()
df.rdd.map(lambda x: x).count()

Before

31166064 function calls (31163984 primitive calls) in 150.882 seconds

After

31166064 function calls (31163984 primitive calls) in 153.220 seconds

IMPORTANT:
The benchmark was done on top of #19246.
Without #19246 the performance improvement will be even greater.

How was this patch tested?

Existing tests.
Performance benchmark.

@SparkQA
Copy link

SparkQA commented Sep 15, 2017

Test build #81827 has finished for PR 19249 at commit aa69a72.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2017

Test build #81828 has finished for PR 19249 at commit e4d7f76.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2017

Test build #81829 has finished for PR 19249 at commit 64afb16.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 16, 2017

Do you have some benchmarks and numbers with this?

@HyukjinKwon
Copy link
Member

To be honest, it looks too trivial that I won't bother.

@maver1ck
Copy link
Contributor Author

I was checking this with my production code.
This give me about 6-7% of speed up and remove 408 millions of function calls :)

I'll try to create benchmark for this.

@HyukjinKwon
Copy link
Member

Did it save 6~7% of the total execution time?

@maver1ck
Copy link
Contributor Author

maver1ck commented Sep 16, 2017

Yep. In real world scenarios.

@HyukjinKwon
Copy link
Member

Okay, then let's go ahead then. Let'd add some numbers in the PR description.

@@ -619,7 +621,8 @@ def fromInternal(self, obj):
# it's already converted by pickler
return obj
if self._needSerializeAnyField:
values = [f.fromInternal(v) for f, v in zip(self.fields, obj)]
values = [f.fromInternal(v) if n else v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. This can be recursive and per-record and we avoid here by pre-computing. I see. That makes much sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's describe this in more details and add some numbers (and in your other PRs too).

@maver1ck
Copy link
Contributor Author

maver1ck commented Sep 16, 2017

I added benchmark for this code.
In benchmark performance boost is even greater (more than 20%)

@gatorsmile
Copy link
Member

Could you mark [PySpark] in the title? cc @ueshin

@@ -619,7 +621,8 @@ def fromInternal(self, obj):
# it's already converted by pickler
return obj
if self._needSerializeAnyField:
values = [f.fromInternal(v) for f, v in zip(self.fields, obj)]
values = [f.fromInternal(v) if n else v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we run a benchmark with the worst case, when all columns are needed to be converted? I think here we pay another if and extra element in the zip to prevert function call basically. This one looks okay practically but I guess we should also identify the downside.

Also, let's add a comment here to describe what we are doing here and also add some links to this PR for other guys to refer the benchmarks.

Copy link
Contributor Author

@maver1ck maver1ck Sep 17, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this worst case scenario.

Test

df = spark.range(1000000).selectExpr("current_timestamp as id0", "current_timestamp as id1", "current_timestamp as id2", "current_timestamp as id3", "current_timestamp as id4", "current_timestamp as id5", "current_timestamp as id6", "current_timestamp as id7", "current_timestamp as id8", "current_timestamp as id9").cache()
df.count()
df.rdd.map(lambda x: x).count()

Before

31166064 function calls (31163984 primitive calls) in 150.882 seconds

After

31166064 function calls (31163984 primitive calls) in 153.220 seconds

So it's a little bit slower (2%). But I think with real world data this scenario is almost imposible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing this out. Let's add a comment here to explain what we are doing and move the worst case benchmark into the PR description.

@maver1ck maver1ck changed the title [SPARK-22032] Speed up StructType.fromInternal [SPARK-22032][PySpark] Speed up StructType.fromInternal Sep 17, 2017
@@ -483,7 +483,8 @@ def __init__(self, fields=None):
self.names = [f.name for f in fields]
assert all(isinstance(f, StructField) for f in fields),\
"fields should be a list of StructField"
self._needSerializeAnyField = any(f.needConversion() for f in self)
self._needConversion = [f.needConversion() for f in self]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename this to another, for example, _needConversions (or others if there are) and leave a comment here why we do this.

@@ -619,7 +621,8 @@ def fromInternal(self, obj):
# it's already converted by pickler
return obj
if self._needSerializeAnyField:
values = [f.fromInternal(v) for f, v in zip(self.fields, obj)]
values = [f.fromInternal(v) if n else v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing this out. Let's add a comment here to explain what we are doing and move the worst case benchmark into the PR description.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimal change and practically significant improvement. LGTM. @ueshin, do you maybe have some comments on this?

@@ -619,7 +621,8 @@ def fromInternal(self, obj):
# it's already converted by pickler
return obj
if self._needSerializeAnyField:
values = [f.fromInternal(v) for f, v in zip(self.fields, obj)]
values = [f.fromInternal(v) if n else v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the similar trick on toInternal?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, looks we could too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@maver1ck maver1ck changed the title [SPARK-22032][PySpark] Speed up StructType.fromInternal [SPARK-22032][PySpark] Speed up StructType conversion Sep 17, 2017
@SparkQA
Copy link

SparkQA commented Sep 17, 2017

Test build #81852 has finished for PR 19249 at commit b1800ac.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Sep 17, 2017

LGTM

@HyukjinKwon
Copy link
Member

LGTM too but hey @maver1ck could you add some comments around the codes and move the worst case benchmarks into the PR description? I guess this wouldn't be too demanding.

@maver1ck
Copy link
Contributor Author

Done.

@SparkQA
Copy link

SparkQA commented Sep 17, 2017

Test build #81853 has finished for PR 19249 at commit 8708a9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 17, 2017

Test build #81854 has finished for PR 19249 at commit e9b7798.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in f407302 Sep 17, 2017
@ueshin
Copy link
Member

ueshin commented Sep 18, 2017

A late LGTM. Btw, can we use the same idea for MapType?

@HyukjinKwon
Copy link
Member

Thanks for double checking @ueshin.

Yes, I noticed that too while reviewing it. I just decided to merge it as is because I am quite sure of this one given struct type is the root type and this case looks quite common, and regarding that it looks the first contribution. Even though this one has a downside, practically the improvement looked better.

I am also fine with doing this for others too (I am +0 for other types).

@maver1ck
Copy link
Contributor Author

@ueshin
I think that for Maptype this is not a solution because every key / value of MapType is the same type so we need conversion for all entries or for nothing

@HyukjinKwon
Copy link
Member

We can split two needConversion for key and value only and save key conversion or value conversion call though?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants