Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType #267

arthurdk · 2018-02-02T11:30:04Z

Hello,

We just upgraded our stack from Spark 1.6 to Spark 2.2 and with that me moved from com.databricks:spark-avro_2.10:2.0.1 to com.databricks:spark-avro_2.11:4.0.0.

We noticed a huge increase in the running time in one of our script. Here is the schema of the files we are loading from HDFS:

df.printSchema
root
|-- field1: string (nullable = true)
|-- field2: string (nullable = true)
|-- field3: array (nullable = true)
|    |-- element: struct (containsNull = false)
|    |    |-- _0: string (nullable = true)
|    |    |-- _1: integer (nullable = true)
|    |    |-- _2: long (nullable = true)

In Spark 1 our script runs in ~2 minutes vs ~40 minutes in Spark 2.

At first I suspected, our script & user defined functions to be quite slow. But then I updated the script to simply read & write our file:

val df = spark.read.avro("/path/to/file/in").write.avro("/path/to/file/out")

And we were still facing the same performance issue: in Spark 1 this runs in ~2 minutes and in Spark 2 this runs in ~40 minutes.

To give your more info on the files we are loading : there are ~2 500 000 entries are the number of struct elements in the array can be quite high:

val df = spark.read.avro("/path/to/file/in")
df.select(size(col("field3")).as("size")).select(avg(col("size")), min(col("size")), max(col("size"))).show
+-----------------+---------+---------+
|        avg(size)|min(size)|max(size)|
+-----------------+---------+---------+
|133.0953942943108|        1|   143220|
+-----------------+---------+---------+

Could you look into this? If you need any additional information feel free to ask!

The text was updated successfully, but these errors were encountered:

gengliangwang · 2018-02-05T09:55:03Z

The read and write path is indeed slower in current release.
For 2.0.1 version:

read path:   Avro => Row
write path: Row => Avro

while in 4.0:

read path:   Avro => Row => InternalRow
write path:  InternalRow => Row => Avro

The conversion between Row and InternalRow is slow.

The upside is that computation is faster: https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html

In the next release, this problem should be fixed as:

read path:   Avro => InternalRow
write path:  InternalRow => Avro

arthurdk · 2018-02-05T10:40:21Z

Many thanks for the explanation! Do you have an ETA for the next release?

gengliangwang · 2018-02-05T10:49:46Z

There is not ETA yet. I will comment this issue once fixed.

ryanivanka · 2018-08-09T11:48:50Z

May I ask if this issue is already fixed?

Our test AVRO file has more than 10% performance downgrade compared to spark 1.6.

arthurdk mentioned this issue May 23, 2018

Spark 2.2 + spark-avro_2.11-4.0.0.jar is very slow for certain job comparing Spark 1.6.3 + spark-avro_2.10-2.0.1.jar #280

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType #267

Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType #267

arthurdk commented Feb 2, 2018 •

edited

Loading

gengliangwang commented Feb 5, 2018

arthurdk commented Feb 5, 2018

gengliangwang commented Feb 5, 2018

ryanivanka commented Aug 9, 2018

Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType #267

Big performance issue when moving from 2.0.1 to 4.0.0 when loading column of type ArrayType #267

Comments

arthurdk commented Feb 2, 2018 • edited Loading

gengliangwang commented Feb 5, 2018

arthurdk commented Feb 5, 2018

gengliangwang commented Feb 5, 2018

ryanivanka commented Aug 9, 2018

arthurdk commented Feb 2, 2018 •

edited

Loading