Skip to content
This repository has been archived by the owner on Dec 20, 2018. It is now read-only.

Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType #267

Open
arthurdk opened this issue Feb 2, 2018 · 4 comments

Comments

@arthurdk
Copy link

arthurdk commented Feb 2, 2018

Hello,

We just upgraded our stack from Spark 1.6 to Spark 2.2 and with that me moved from com.databricks:spark-avro_2.10:2.0.1 to com.databricks:spark-avro_2.11:4.0.0.

We noticed a huge increase in the running time in one of our script. Here is the schema of the files we are loading from HDFS:

df.printSchema
root
|-- field1: string (nullable = true)
|-- field2: string (nullable = true)
|-- field3: array (nullable = true)
|    |-- element: struct (containsNull = false)
|    |    |-- _0: string (nullable = true)
|    |    |-- _1: integer (nullable = true)
|    |    |-- _2: long (nullable = true)

In Spark 1 our script runs in ~2 minutes vs ~40 minutes in Spark 2.

At first I suspected, our script & user defined functions to be quite slow. But then I updated the script to simply read & write our file:

val df = spark.read.avro("/path/to/file/in").write.avro("/path/to/file/out")

And we were still facing the same performance issue: in Spark 1 this runs in ~2 minutes and in Spark 2 this runs in ~40 minutes.

To give your more info on the files we are loading : there are ~2 500 000 entries are the number of struct elements in the array can be quite high:

val df = spark.read.avro("/path/to/file/in")
df.select(size(col("field3")).as("size")).select(avg(col("size")), min(col("size")), max(col("size"))).show
+-----------------+---------+---------+
|        avg(size)|min(size)|max(size)|
+-----------------+---------+---------+
|133.0953942943108|        1|   143220|
+-----------------+---------+---------+

Could you look into this? If you need any additional information feel free to ask!

@gengliangwang
Copy link
Contributor

The read and write path is indeed slower in current release.
For 2.0.1 version:

read path:   Avro => Row
write path: Row => Avro

while in 4.0:

read path:   Avro => Row => InternalRow
write path:  InternalRow => Row => Avro

The conversion between Row and InternalRow is slow.

The upside is that computation is faster: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

In the next release, this problem should be fixed as:

read path:   Avro => InternalRow
write path:  InternalRow => Avro

@arthurdk
Copy link
Author

arthurdk commented Feb 5, 2018

Many thanks for the explanation! Do you have an ETA for the next release?

@gengliangwang
Copy link
Contributor

There is not ETA yet. I will comment this issue once fixed.

@ryanivanka
Copy link

May I ask if this issue is already fixed?

Our test AVRO file has more than 10% performance downgrade compared to spark 1.6.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants