[SPARK-22032][PySpark] Speed up StructType conversion #19249

maver1ck · 2017-09-15T18:02:32Z

What changes were proposed in this pull request?

StructType.fromInternal is calling f.fromInternal(v) for every field.
We can use precalculated information about type to limit the number of function calls. (its calculated once per StructType and used in per record calculations)

Benchmarks (Python profiler)

df = spark.range(10000000).selectExpr("id as id0", "id as id1", "id as id2", "id as id3", "id as id4", "id as id5", "id as id6", "id as id7", "id as id8", "id as id9", "struct(id) as s").cache()
df.count()
df.rdd.map(lambda x: x).count()

Before

310274584 function calls (300272456 primitive calls) in 1320.684 seconds

Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 10000000  253.417    0.000  486.991    0.000 types.py:619(<listcomp>)
 30000000  192.272    0.000 1009.986    0.000 types.py:612(fromInternal)
100000000  176.140    0.000  176.140    0.000 types.py:88(fromInternal)
 20000000  156.832    0.000  328.093    0.000 types.py:1471(_create_row)
    14000  107.206    0.008 1237.917    0.088 {built-in method loads}
 20000000   80.176    0.000 1090.162    0.000 types.py:1468(<lambda>)

After

210274584 function calls (200272456 primitive calls) in 1035.974 seconds

Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 30000000  215.845    0.000  698.748    0.000 types.py:612(fromInternal)
 20000000  165.042    0.000  351.572    0.000 types.py:1471(_create_row)
    14000  116.834    0.008  946.791    0.068 {built-in method loads}
 20000000   87.326    0.000  786.073    0.000 types.py:1468(<lambda>)
 20000000   85.477    0.000  134.607    0.000 types.py:1519(__new__)
 10000000   65.777    0.000  126.712    0.000 types.py:619(<listcomp>)

Main difference is types.py:619() and types.py:88(fromInternal) (which is removed in After)
The number of function calls is 100 million less. And performance is 20% better.

Benchmark (worst case scenario.)

Test

df = spark.range(1000000).selectExpr("current_timestamp as id0", "current_timestamp as id1", "current_timestamp as id2", "current_timestamp as id3", "current_timestamp as id4", "current_timestamp as id5", "current_timestamp as id6", "current_timestamp as id7", "current_timestamp as id8", "current_timestamp as id9").cache()
df.count()
df.rdd.map(lambda x: x).count()

Before

31166064 function calls (31163984 primitive calls) in 150.882 seconds

After

31166064 function calls (31163984 primitive calls) in 153.220 seconds

IMPORTANT:
The benchmark was done on top of #19246.
Without #19246 the performance improvement will be even greater.

How was this patch tested?

Existing tests.
Performance benchmark.

SparkQA · 2017-09-15T18:08:24Z

Test build #81827 has finished for PR 19249 at commit aa69a72.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-15T18:23:24Z

Test build #81828 has finished for PR 19249 at commit e4d7f76.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-15T18:58:03Z

Test build #81829 has finished for PR 19249 at commit 64afb16.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-16T12:15:17Z

Do you have some benchmarks and numbers with this?

HyukjinKwon · 2017-09-16T12:18:04Z

To be honest, it looks too trivial that I won't bother.

maver1ck · 2017-09-16T12:28:09Z

I was checking this with my production code.
This give me about 6-7% of speed up and remove 408 millions of function calls :)

I'll try to create benchmark for this.

HyukjinKwon · 2017-09-16T12:31:42Z

Did it save 6~7% of the total execution time?

maver1ck · 2017-09-16T12:34:34Z

Yep. In real world scenarios.

HyukjinKwon · 2017-09-16T12:37:11Z

Okay, then let's go ahead then. Let'd add some numbers in the PR description.

HyukjinKwon · 2017-09-16T12:47:42Z

python/pyspark/sql/types.py

@@ -619,7 +621,8 @@ def fromInternal(self, obj):
            # it's already converted by pickler
            return obj
        if self._needSerializeAnyField:
-            values = [f.fromInternal(v) for f, v in zip(self.fields, obj)]
+            values = [f.fromInternal(v) if n else v


Ah, I see. This can be recursive and per-record and we avoid here by pre-computing. I see. That makes much sense.

Let's describe this in more details and add some numbers (and in your other PRs too).

maver1ck · 2017-09-16T17:11:20Z

I added benchmark for this code.
In benchmark performance boost is even greater (more than 20%)

gatorsmile · 2017-09-16T23:13:00Z

Could you mark [PySpark] in the title? cc @ueshin

HyukjinKwon · 2017-09-17T04:21:45Z

python/pyspark/sql/types.py

@@ -619,7 +621,8 @@ def fromInternal(self, obj):
            # it's already converted by pickler
            return obj
        if self._needSerializeAnyField:
-            values = [f.fromInternal(v) for f, v in zip(self.fields, obj)]
+            values = [f.fromInternal(v) if n else v


Could we run a benchmark with the worst case, when all columns are needed to be converted? I think here we pay another if and extra element in the zip to prevert function call basically. This one looks okay practically but I guess we should also identify the downside.

Also, let's add a comment here to describe what we are doing here and also add some links to this PR for other guys to refer the benchmarks.

I checked this worst case scenario.

Test

df = spark.range(1000000).selectExpr("current_timestamp as id0", "current_timestamp as id1", "current_timestamp as id2", "current_timestamp as id3", "current_timestamp as id4", "current_timestamp as id5", "current_timestamp as id6", "current_timestamp as id7", "current_timestamp as id8", "current_timestamp as id9").cache() df.count() df.rdd.map(lambda x: x).count()

Before

31166064 function calls (31163984 primitive calls) in 150.882 seconds

After

31166064 function calls (31163984 primitive calls) in 153.220 seconds

So it's a little bit slower (2%). But I think with real world data this scenario is almost imposible.

Thanks for testing this out. Let's add a comment here to explain what we are doing and move the worst case benchmark into the PR description.

HyukjinKwon · 2017-09-17T09:08:15Z

python/pyspark/sql/types.py

@@ -483,7 +483,8 @@ def __init__(self, fields=None):
            self.names = [f.name for f in fields]
            assert all(isinstance(f, StructField) for f in fields),\
                "fields should be a list of StructField"
-        self._needSerializeAnyField = any(f.needConversion() for f in self)
+        self._needConversion = [f.needConversion() for f in self]


I'd rename this to another, for example, _needConversions (or others if there are) and leave a comment here why we do this.

HyukjinKwon · 2017-09-17T09:16:58Z

python/pyspark/sql/types.py

@@ -619,7 +621,8 @@ def fromInternal(self, obj):
            # it's already converted by pickler
            return obj
        if self._needSerializeAnyField:
-            values = [f.fromInternal(v) for f, v in zip(self.fields, obj)]
+            values = [f.fromInternal(v) if n else v


Thanks for testing this out. Let's add a comment here to explain what we are doing and move the worst case benchmark into the PR description.

HyukjinKwon

Minimal change and practically significant improvement. LGTM. @ueshin, do you maybe have some comments on this?

viirya · 2017-09-17T09:24:20Z

python/pyspark/sql/types.py

@@ -619,7 +621,8 @@ def fromInternal(self, obj):
            # it's already converted by pickler
            return obj
        if self._needSerializeAnyField:
-            values = [f.fromInternal(v) for f, v in zip(self.fields, obj)]
+            values = [f.fromInternal(v) if n else v


Can we use the similar trick on toInternal?

Yea, looks we could too.

SparkQA · 2017-09-17T09:43:23Z

Test build #81852 has finished for PR 19249 at commit b1800ac.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-17T09:48:39Z

LGTM

HyukjinKwon · 2017-09-17T09:55:58Z

LGTM too but hey @maver1ck could you add some comments around the codes and move the worst case benchmarks into the PR description? I guess this wouldn't be too demanding.

maver1ck · 2017-09-17T10:03:08Z

Done.

SparkQA · 2017-09-17T10:13:00Z

Test build #81853 has finished for PR 19249 at commit 8708a9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-17T10:33:12Z

Test build #81854 has finished for PR 19249 at commit e9b7798.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-17T17:35:26Z

Merged to master.

ueshin · 2017-09-18T03:59:36Z

A late LGTM. Btw, can we use the same idea for MapType?

HyukjinKwon · 2017-09-18T04:09:49Z

Thanks for double checking @ueshin.

Yes, I noticed that too while reviewing it. I just decided to merge it as is because I am quite sure of this one given struct type is the root type and this case looks quite common, and regarding that it looks the first contribution. Even though this one has a downside, practically the improvement looked better.

I am also fine with doing this for others too (I am +0 for other types).

maver1ck · 2017-09-18T07:15:17Z

@ueshin
I think that for Maptype this is not a solution because every key / value of MapType is the same type so we need conversion for all entries or for nothing

HyukjinKwon · 2017-09-18T07:35:55Z

We can split two needConversion for key and value only and save key conversion or value conversion call though?

Update types.py

aa69a72

PEP8 fix

e4d7f76

PEP8 fix

64afb16

HyukjinKwon reviewed Sep 16, 2017

View reviewed changes

HyukjinKwon reviewed Sep 17, 2017

View reviewed changes

maver1ck changed the title ~~[SPARK-22032] Speed up StructType.fromInternal~~ [SPARK-22032][PySpark] Speed up StructType.fromInternal Sep 17, 2017

HyukjinKwon reviewed Sep 17, 2017

View reviewed changes

HyukjinKwon approved these changes Sep 17, 2017

View reviewed changes

viirya reviewed Sep 17, 2017

View reviewed changes

Add optimization to 'toInternal' function

b1800ac

maver1ck changed the title ~~[SPARK-22032][PySpark] Speed up StructType.fromInternal~~ [SPARK-22032][PySpark] Speed up StructType conversion Sep 17, 2017

PEP8 fix

8708a9d

Comments

e9b7798

asfgit closed this in f407302 Sep 17, 2017

HyukjinKwon mentioned this pull request Sep 20, 2017

[SPARK-22025][PySpark] Speeding up fromInternal for StructField #19246

Closed

HyukjinKwon mentioned this pull request Jan 5, 2018

[SPARK-22966][PYTHON][SQL] Python UDFs with returnType=StringType should treat return values of datetime.date or datetime.datetime as unconvertible #20163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-22032][PySpark] Speed up StructType conversion #19249

[SPARK-22032][PySpark] Speed up StructType conversion #19249

maver1ck commented Sep 15, 2017 •

edited

Loading

SparkQA commented Sep 15, 2017

SparkQA commented Sep 15, 2017

SparkQA commented Sep 15, 2017

HyukjinKwon commented Sep 16, 2017 •

edited

Loading

HyukjinKwon commented Sep 16, 2017

maver1ck commented Sep 16, 2017

HyukjinKwon commented Sep 16, 2017

maver1ck commented Sep 16, 2017 •

edited

Loading

HyukjinKwon commented Sep 16, 2017

HyukjinKwon Sep 16, 2017

HyukjinKwon Sep 16, 2017

maver1ck commented Sep 16, 2017 •

edited

Loading

gatorsmile commented Sep 16, 2017

HyukjinKwon Sep 17, 2017

maver1ck Sep 17, 2017 •

edited

Loading

HyukjinKwon Sep 17, 2017

HyukjinKwon Sep 17, 2017

HyukjinKwon Sep 17, 2017

HyukjinKwon left a comment

viirya Sep 17, 2017

HyukjinKwon Sep 17, 2017

maver1ck Sep 17, 2017

SparkQA commented Sep 17, 2017

viirya commented Sep 17, 2017

HyukjinKwon commented Sep 17, 2017

maver1ck commented Sep 17, 2017

SparkQA commented Sep 17, 2017

SparkQA commented Sep 17, 2017

HyukjinKwon commented Sep 17, 2017

ueshin commented Sep 18, 2017

HyukjinKwon commented Sep 18, 2017

maver1ck commented Sep 18, 2017

HyukjinKwon commented Sep 18, 2017

[SPARK-22032][PySpark] Speed up StructType conversion #19249

[SPARK-22032][PySpark] Speed up StructType conversion #19249

Conversation

maver1ck commented Sep 15, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 15, 2017

SparkQA commented Sep 15, 2017

SparkQA commented Sep 15, 2017

HyukjinKwon commented Sep 16, 2017 • edited Loading

HyukjinKwon commented Sep 16, 2017

maver1ck commented Sep 16, 2017

HyukjinKwon commented Sep 16, 2017

maver1ck commented Sep 16, 2017 • edited Loading

HyukjinKwon commented Sep 16, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maver1ck commented Sep 16, 2017 • edited Loading

gatorsmile commented Sep 16, 2017

Choose a reason for hiding this comment

maver1ck Sep 17, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 17, 2017

viirya commented Sep 17, 2017

HyukjinKwon commented Sep 17, 2017

maver1ck commented Sep 17, 2017

SparkQA commented Sep 17, 2017

SparkQA commented Sep 17, 2017

HyukjinKwon commented Sep 17, 2017

ueshin commented Sep 18, 2017

HyukjinKwon commented Sep 18, 2017

maver1ck commented Sep 18, 2017

HyukjinKwon commented Sep 18, 2017

maver1ck commented Sep 15, 2017 •

edited

Loading

HyukjinKwon commented Sep 16, 2017 •

edited

Loading

maver1ck commented Sep 16, 2017 •

edited

Loading

maver1ck commented Sep 16, 2017 •

edited

Loading

maver1ck Sep 17, 2017 •

edited

Loading